Explain Apache SOLR boost function - boost

I'm try to implement a logic in APACHE SOLR so that documents older than 2 years should get penalty based on the difference in number of days or months.
I am using this boost function, which I got after googling a lot.
recip(ms(NOW,publicationDate),3.16e-11,1,1) // Currently it is set to use 1 year
Can any please confirm if this penalties old documents or what ?
Thanks

A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b).
m,a,b are constants, x is any numeric field or arbitrarily complex
function.
In case of your parameters, your function will look like this:
f(x) = 1 /(3.16e-11*x + 1)
Function ms returns milliseconds of difference between it's
arguments.
Dates are relative to the Unix or POSIX time epoch, midnight, January
1, 1970 UTC.
Imagine, your publication date is September 1st 2015, ms will get us NOW = 1507725936061 and publication date is 1441065600000 and the whole result will be around 0.3 which will be the score for this document.
For publication date of yesterday, we will get score of 0.99, which leads to the idea, so, this formula will apply penalty to every document not only to ones which are 2 years old. For example, for the same day 1 year ago the score will be 0.5
I could think potentially about sorting by this function (starting from Solr 6)
if(gt(ms(mydatefield,NOW-2YEARS),0),1,recip(ms(NOW,publicationDate),3.16e-11,1,1))
I didn't test it (not sure about NOW-2YEARS part), but basically, i'm doing this:
if mydatefield - NOW-2YEARS greater
than 0 => score will be 1.0
else => I'm calculating reciprocal function
One last remark: there are 3.16e10 milliseconds in a year, so one can scale dates to fractions of a year with the inverse, or 3.16e-11, so for 2 years, you may select something different.

Related

Unknown timestamp reference date

I'm currently dealing with a system which uses an unknown timestamp mechanism.
The system is running on a Windows machine, so my first thought was that it uses some kind of Windows epoch for its timestamps, but it appears it does not.
My goal is to convert these timestamps to Unix timestamps.
A few examples:
The following timestamp: 2111441659 converts to: 2013-10-01 11:59
2111441998 to 2013-10-01 17:14
2111443876 to 2013-10-02 14:36
2111444089 to 2013-10-02 17:57
(All dates are GMT+2)
I've tried to calculate the reference date using the data above, but somehow I get a different result with every single timestamp.
Could anybody shed some light on this rather odd problem?
Thanks in advance!
To me the number seems to small to be milliseconds. My first guess was then seconds but looking at the speed this number varies with i think minutes is a better guess. Doing some math on it 2111441659/60/24/365 = 4017.20254756 which suggests the epoch might be sometime in the year -2000?
Here is a list of common epochs in computing but the year -2000 is not really there :) How are you obtaining this timestamp?
P.S. are you sure the year is set to 2013 on this machine and not to 4013? :) This would then fit with the .NET epoch of January 1, Year 1
In order to distinguish your timestamp from Unix timestamp, let's call yours The Counter.
So we have four counter values with their corresponding DateTime value. The first thing to do is calculate the counter's unit correspondence to a real time unit, let's say a second.
In order to do that, we need (1) the difference d between two counter values and (2) the difference s between their corresponding DateTimes, in seconds.
Considering the first two values we have d1=2111441998-2111441659=339. The difference between 2013-10-01 11:59 and 2013-10-01 17:14 (in seconds) is s1=18900. Consequently, the counter's unit corresponds to u1=s1/d1=55.7522123894 seconds.
But if we do the same with pairs #2 and #3, we will find that u2=40.9584664536 seconds.
Similarily, pairs #3 and #4 give us u3=56.6197183114 seconds.
My conclusion therefore, is that there's no alignment between the counter values and the corresponding DateTimes provided. That's the reason why you get a different result with each sample.
Finally, after many hours of comparing the timestamps with the datetimes, trying to discover the logic between them, I've found the answer by reverse engineering the software which generates the timestamps.
It turns out that the integer timestamps are actually bitwise representations* of the datetimes.
In pseudocode:
year = TimeStamp >> 20;
month = (TimeStamp >> 16) & 15;
day = (TimeStamp >> 11) & 31;
hour = (TimeStamp >> 6) & 31;
minute = TimeStamp & 63;
*I'm not sure if this is the correct term for it, if not, please correct me.

How to handle recurring times?

First off, I marked this question as language agnostic, but I'm using PHP and MySQL. It shouldn't affect the question itself very much tho.
I'm creating an application which shows times of certain shows throughout the week. Every single show is recurring (on weekly basis) and there might be shows which are airing through 2 days - eg. starting on Sunday at 23:30, ending on Monday at 00:30. I'm storing start of the show (day of the week - Monday, Tuesday... - it's never exact date; time) and duration. There are never shows that would take more than 24 hours.
My problem is with validation if newly added shows aren't overlapping some old ones. Especially if it comes to Sunday-Monday shows.
How are such recurring events usually handled on both DB side and server side?
tl;dr version with stuff I considered
My first idea was to create some custom validation algorithm, but it seemed too cumbersome and complicated. Not that I'd whine about complicated hand-made solutions, but I'm interested if there isn't something more basic that I'm missing.
Other alternative that came to mind was to change table structure to use datetime (instead of "day of week" and "time"), and use a fake fixed date range to store the data. For example all Mondays would be set to 5th Jan 1970, Sundays would use 11th Jan 1970. There would be one exception to this rule - if there would be some show which starts on Sunday and ends on Monday, it would be stored as 12th Jan 1970. This solution would allow more flexible quering of the DB than the original one, and it would also simplify queries for shows which overlap between individual weeks (since we can do the comparison directly in the query). There are some disadvantages to this solution as well (for one, using fake dates might make it confusing).
Both solutions smell of wrong algorithms to me and would love to hear some opinions from more experienced fellow developers.
Sounds like you could just store the starting minute of each show as an integer number of minutes since the start of the week (10,080 possible values).
Then a show starting at minute $a with duration $dur_a will overlap $b if and only if
(10080 + $b - $a) % 10080 < $dur_a
For example consider a show starting at 11pm Sunday and another starting at 12.30am Monday. Here $a == 10020 and $dur_a == 120 and $b == 30. (10080 + $b - $a) % 10080 == 90. This is less than $dur_a and hence the shows overlap.
This problem could be simplified by converting the data into a format that is amenable to the calculations that are required. I recommend creating a type that represents the start times as the number of minutes from Sunday at midnight. Then simple integer range comparisons could be used to find overlapping shows.
The internal representation must, of course, be hidden and abstracted. You may, at some point, want to change the representation from minutes to seconds, for example.
I would opt for a custom validation algorithm:
For each show, compute all showing intervals [start1, end1], [start2, end2], ... [startN, endN], where N is the number of recurrence of the show.
For a new show, also compute these intervals.
Now check if any of these new intervals intersect any old intervals. This is the case if the start or the end of one interval is contained in the other.

Slight problem with day of the week calculation (base doomsday for a century)

From this online calculator: http://homer.freeshell.org/dd.cgi using its data I've successfully written a working version, however its data is limited to years 1500 to 2600. I want to modify (and make a better one) so that I can calculate for any year > 2600.
Referring to Table X, is there actually a formula to calculate the base doomsday for all base centuries (above 2600)?
I've tried working it out myself by putting centuries higher than this e.g. 2700 gave me a base doomsday of '00', 2800 gave '02;, 2900 back to '00' again...
Help appreciated.
As I understand it, that page's “Base Doomsday” is just an offset to allow for the four-hundred-year cycle of leap day calculations. So, you can extend it indefinitely into the future simply by adding blocks of four centuries.
Are there any other calculators out there that do this?
Two common methods for calculating the day of the week
given a date are Doomsday, which you are using,
and Zeller's Congruence
www.merlyn.demon.co.uk provides
some really interesting information on date/time calculations, various calendar
systems and significant dates as they relate to calendar/date calculations.
The calculator at this link http://homer.freeshell.org/dd.cgi is the best in terms of explaining doomsday algorithm cleanly and clearly for human, with one little caveat.
If you input 2/29/1900, it would say it's a Thursday. Well, there is no 2/29/1900, because it's not a leap year.
Of course if your input 1/35/2016, it would "garbage-in-garbage-out" for you as well.
Imagine there are only 364 days in a year, then the day of week for each date will never change year after year, because mod(364,7)==0.
But we have 365 days a year, so the day steps forward 1 each year, that's where the second term mod(year, 7) comes from.
In addition, every 4 year, there is a leap year, which contributes to the last term mod(year, 4).
But every 100 years, you subtract a leap year, and every 400 years, you add one leap year. That's where the first term "3,2,0,5" comes in.
You see, it's all because of this leap year, and mod(365,7)==1 business.
7/11, 5to9 helps to remember table Z greatly.

Calculating day-of-week in years greater than 9999

I was wondering if there are any algorithms that calculate the day of week in years that are greater than the year 9999.
Algorithms such Zeller’s algorithm or this one here gives false results, since they handle only 4 digit year.
Thank you.
You don't actually need a new algorithm. As long as you have one algorithm with a range of 400 years (or more), you can bring any date inside the range of that algorithm. This works because the Gregorian calendar repeats every 400 years (XX/YY/ZZZZ is the same weekday as XX/YY/(ZZZZ+400)).
So, if we assume that you have some algorithm that works for the dates 1/1/1600 to 31/12/1999 (both inclusive), you can calculate the weekday for any date by using (year mod 400)+1600 as the year.
If you don't have a 400-year range starting on 1/1/XXXX (where XXXX mod 400 = 0), you need to manipulate the date slightly different to get the right result (instead of adding 1600 to the year, add X*400, where X is an integer such that some of the dates will be in the range, then add or subtract 400 to the year for those dates that are outside of the range).
http://lxr.linux.no/linux/net/netfilter/xt_time.c for example simply counts it out. To reduce the number of iterations in loops, static tables may be used, as has been done there.

Finding similarities in a multidimensional array

Consider a sales department that sets a sales goal for each day. The total goal isn't important, but the overage or underage is. For example, if Monday of week 1 has a goal of 50 and we sell 60, that day gets a score of +10. On Tuesday, our goal is 48 and we sell 46 for a score of -2. At the end of the week, we score the week like this:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
In this example, both Monday (0,0) and Thursday and Friday (0,3 and 0,4) are "hot"
If we look at the results from week 2, we see:
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
For week 2, the end of the week is hot, and Tuesday is warm.
Next, if we compare weeks one and two, we see that the end of the week tends to be better than the first part of the week. So, now let's add weeks 3 and 4:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
[2,0]=-8,[2,1]=-2,[2,2]=-1,[2,3]=2,[2,4]=3
[3,0]=2,[3,1]=3,[3,2]=4,[3,3]=7,[3,4]=9
From this, we see that the end of the week is better theory holds true. But we also see that end of the month is better than the start. Of course, we would want to next compare this month with next month, or compare a group of months for quarterly or annual results.
I'm not a math or stats guy, but I'm pretty sure there are algorithms designed for this type of problem. Since I don't have a math background (and don't remember any algebra from my earlier days), where would I look for help? Does this type of "hotspot" logic have a name? Are there formulas or algorithms that can slice and dice and compare multidimensional arrays?
Any help, pointers or advice is appreciated!
This data isn't really multidimensional, it's just a simple time series, and there are many ways to analyse it. I'd suggest you start with the Fourier Transform, it detects "rhythms" in a series, so this data would show a spike at 7 days, and also around thirty, and if you extended the data set to a few years it would show a one-year spike for seasons and holidays. That should keep you busy for a while, until you're ready to use real multidimensional data, say by adding in weather information, stock market data, results of recent sports events and so on.
The following might be relevant to you: Stochastic oscillators in technical analysis, which are used to determine whether a stock has been overbought or oversold.
I'm oversimplifying here, but essentially you have two moving calculations:
14-day stochastic: 100 * (today's closing price - low of last 14 days) / (high of last 14 days - low of last 14 days)
3-day stochastic: same calculation, but relative to 3 days.
The 14-day and 3-day stochastics will have a tendency to follow the same curve. Your stochastics will fall somewhere between 1.0 and 0.0; stochastics above 0.8 are considered overbought or bearish, below 0.2 indicates oversold or bullish. More specifically, when your 3-day stochastic "crosses" the 14-day stochastic in one of those regions, you have predictor of momentum of the prices.
Although some people consider technical analysis to be voodoo, empirical evidence indicates that it has some predictive power. For what its worth, a stochastic is a very easy and efficient way to visualize the momentum of prices over time.
It seems to me that an OLAP approach (like pivot tables in MS Excel) fit the problem perfectly.
What you want to do is quite simple - you just have to calculate the autocorrelation of your data and look at the correlogram. From the correlogram you can see 'hidden' periods of your data and then you can use this information to analyze the periods.
Here is the result - your numbers and their normalized autocorrelation.
10 1,000
-2 0,097
1 -0,121
7 0,084
6 0,098
-4 0,154
2 -0,082
-1 -0,550
4 -0,341
5 -0,027
-8 -0,165
-2 -0,212
-1 -0,555
2 -0,426
3 -0,279
2 0,195
3 0,000
4 -0,795
7 -1,000
9
I used Excel to get the values. But the sequence in column A and add the equation =CORREL($A$1:$A$20;$A1:$A20) to cell B1 and copy it then up to B19. If you the add a line diagram, you can nicely see the structure of the data.
You can already make reasonable guesses about the periods of patterns - you're looking at things like weekly and monthly. To look for weekly patterns, for example, just average all the mondays together and so on. Same goes for days of the month, for months of the year.
Sure, you could use a complex algorithm to find out that there's a weekly pattern, but you already know to expect that. If you think there really may be patterns buried there that you'd never suspect (there's a strange community of people who use a 5-day week and frequent your business), by all means, use a strong tool -- but if you know what kinds of things to look for, there's really no need.
Daniel has the right idea when he suggested correlation but I don't think autocorrelation is what you want. Instead I would suggest correlating each week with each other week. Peaks in your correlation--that is values close to 1--suggest that the values of the weeks resemble each other (I.e. are peiodic) for that particular shift.
For example when you cross correlate
0 0 1 2 0 0
with
0 0 0 1 1 0
the result would be
2 0 0 1 3 0
the highest value is 3, which corresponds to shifting (right) the second array by 4
0 0 0 1 1 0 --> 0 0 1 1 0 0
and thenn multiplying component wise
0 0 1 2 0 0
0 0 1 1 0 0
----------------------
0 + 0 + 1 + 2 + 0 + 0 = 3
Note that when you correlate you can create your own "fake" week and cross-correlate all your real weeks, the idea being that you are looking for "shapes" of your weekly values that correspond to the shape of your fake week by looking for peaks in the correlation result.
So if you are interested in finding weeks that are close near the end of the week you could use the "fake" week
-1 -1 -1 -1 1 1
and if you get a high response in the first value of the correlation this means that the real week that you correlated with has roughly this shape.
This is probably beyond the scope of what you're looking for, but one technical approach that would give you the ability to do forecasting, look at things like statistical significance, etc., would be ARIMA or similar Box-Jenkins models.

Resources