when exponential curve fit is corrupted by noise - curve-fitting

I met a problem when doing curve fit with this equation
y=a*exp(-x/b)
x is fixed x=[13 26 39 52 65 78 91]. y is the input. a and b are unknows. b is the output. I use LSQ estimation to do curve fitting, and add a constraint for the output b: b should be in the range of [0,1000].
Now the system works like this: when I have an input sequence like
y=[460 434 288 218 164 114 89]
The output is b=51.46, which is good.
If the input sequence is
y=[599 640 592 609 550 588 573 626]
The estimation result is b=1000. This is also good. No problem.
But when I input a pure noise sequence:
y=[24 19 31 5 27 31 17]
The result I get from my curve fitting algorithm is b=1000. In this case, the output b is a very high signal, and this is not acceptable for the system. I expect to output a low value of b, say b = 0.
I tried to add a threshold on y, say
if y<50 then b=0
But the system is not very stable. The noise level changes from time to time. Is there other way to solve this problem? Thank you in advance.

Frist, note that this category of problems commonly appears in literature in terms of logistic growth model (or see here). I believe your specific problem should be considered in the context of the Mixed Model, a statistical model containing both fixed effects and random effects.
More concretely, you might use Matlab's nlmefit from its statistics toolbox.
A bird-view of nlme can be found in this ppt.

Related

Use of Reed-Solomon error correction algorithm with 4-state barcodes

I have a combined data information that requires minimum 35 bits.
Using a 4-state barcode, each bar represents 2 bits, so the above mentioned information can be translated into 18 bars.
I would like to add some strong error correction to this barcode, so if it's somehow damaged, it can be corrected. One of such approach is Reed-Solomon error correction.
My goal is to add as strong error correction as possible, but on the other hand I have a size limitation on the barcode. If I understood the Reed-Solomon algorithm correctly, m∙k  has to be at least the size of my message, i.e. 35 in my case.
Based on the Reed-Solomon Interactive Demo, I can go with (m, n, t, k) being (4, 15, 3, 9), which would allow me to code message up to 4∙9 = 36 bits. This would lead to code word of size 4∙15 = 60 bits, or 30 bars, but the error correction ratio t / n would be just 20.0%.
Next option is to go with (m, n, t, k) being (5, 31, 12, 7), which would allow me to code message up to 5∙7 = 35 bits. This would lead to code word of size 5∙31 = 155 bits, or 78 bars, and the error correction ratio t / n would be ~38.7%.
The first scenario requires use of barcode with 30 bars, which is nice, but 20.0% error correction is not as great as desired. The second scenario offers excellent error correction of 38.7%, but the barcode would have to have 78 bars, which is too many.
Is there some other approach or a different method, that would offer great error correction and a reasonable barcode length?
You could use a shortened code word such as (5, 19, 6, 7) 31.5% correction ratio, 95 bits, 48 bars. One advantage of a shortened code word is reduced chance of mis-correction if it is allowed to correct the maximum of 6 errors. If any of the 6 error locations is outside of the range of valid locations, that is an indication of that there are more than 6 errors. The probability of mis-correction is about (19/31)^6 = 5.3%.

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

How to calculate each number's frequency?

There is a big data file whose format is:
111111 11 22 33 44 55 66 77
222222 21 22 23 29 99 98 00
...... ..
then how can i use prolog to calculate each number's frequency ?
Sincerely!
You have two problems: Parsing the file and calculating the frequencies.
For parsing the file, I recommend using library(pio). In that manner you can use dcgs to process the file. So, I'd recommend you learn first about DCGs. They are Prolog's way to describe/generate and parse text. They are even more general than that. But to start with, just see it that way.
This you can then combine with calculating the frequencies. To make this also efficient for very large data see this question.

nonlinear map from one range to another

I have a maths problem I am somewhat stumped on. I need to map a numbers from one range to another in a nonlinear fashion. I have manually taken some sample data from what I am trying to achieve. That looks as such.
source - desired result
0 - 1
78 - 0.885
363 - 0.625
1429 - 0.3
3404 - 0.155
7524 - 0.075
11604 - 0.05
The source number ranges from 0 to, ideally an infinite number, but happy if it stops somewhere in the 10s of thousands. The resultant number is from 1 to 0. It needs to drop off quickly then level off. Ideally never reaching zero.
I am aware of the standard equation to map from one range to another.
y = ((x * origRange) / newRange) + newRangeOffset
Unfortunately this does not give me the desired results. Is there a elegant nonlinear equation that would give me the results I am after?
f(x) = 620 / (620 + x)
gives an answer accurate to within 2% of all your values
As suggested here, you can use a polynomial interpolation (present in multiple software packages).
If you want to try it, I suggest you to go to Wolfram Alpha and select the Polynomial Interpolation.
This is one example using some of your points.

Simple data structure for the Othello board game?

I've done my program ages ago here as a uni project, at least it works to some extent (you may try the Monkey and Novice level:) ).
I'd like to redesign and re-implement it, so to practice on data structure and algorithm.
In my previous project, min-max search and alpha-beta pruning was the missing part, as well as a lack of opening dictionary.
Because the game board is symmetric both horizontally and vertically, I need a better data structure than my previous approach:
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 11 12 13 14 15 16 17 18 -1
-1 21 22 23 24 25 26 27 28 -1
-1 31 32 33 34 35 36 37 38 -1
. . . . . .
In this way, one can easily calculate the adjacent positions given any cell value like this:
x-11 x-10 x-9
x-1 x x+1
x+9 x+10 x+11
Those -1s are acting like "walls" to prevent wrong calculation.
The biggest issue is it doesn't take any consideration of symmetric/orientation, i.e., same opening like parallel opening would have 4 corresponding opening cases in database, one for each orientation.
Any good suggestion? I am also considering to try ruby as to have a quicker calculation speed than PHP (just for min-max alpha-beta pruning, in case I will program it to look n steps ahead).
Many thanks for the suggestions in advance.
When you hash a position to store or lookup in your database, takes hashes of all eight symmetric positions, and store or lookup only the smallest of the eight. Thus all symmetric positions hash to the same value.
This reduces the size of your database by 8 but multiplies the cost of hashing by 8. Is this a good trade-off? It depends on how big your database is and how often you do database lookups.
After you move to C/C++ :-) consider representing the game board as "bit-boards" e.g. two 64-bit-vectors e.g. for white and black e.g. struct Board { unsigned long white, black };
With care you can then avoid array indexing to test piece positions, and in fact can search in parallel for all up-captures, up-right-captures, etc. from a position using a series of bit logical operators, shifts, and masks, and no loops (!). Much faster.
This representation idea is orthogonal to your questino of opening book symmetries though.
Happy hacking.
The problem is easy to deal with if you seperate the presentation of the board from the internal representation. Once the opening move is made, you get parallel, diagional, or perpendicular opening. Each one of them can be in any of the 4 orientations. Rotate the internal board representation, until it is aligned with your opening book. Then simply take the rotation into account when drawing the board.
In regard to play, you need to look into Mobility Theory. Take a look at Hugo Calendars book on the topic. Also Nick Buro has written a bit about his program Logistello. A FAQ
As that parallel opening only applies for the very first move, I would just make the first move fixed.
If you really want speed, I'd recommend C++.
I would also imagine checking the space is on the board is faster than checking if the space contains a -1.

Resources