what are the tradeoffs for setting: params.put(TrainingParameters.ITERATIONS_PARAM, "100"); - opennlp

what are the tradeoffs for setting:
params.put(TrainingParameters.ITERATIONS_PARAM, "100");
What the settings 10 verse 100 or 1000 actually do?
Thanks

I believe "100" is the default. If I'm right it stands for the times of the iterations trough your trainingdata. So your training set is better with 100 iterations then 10.

Related

Estimation time to read an article In ASP MVC

How can I estimate time to read an article like docs.asp.net website
docs.asp.net
At the top of all articles, it says you need xxx minutes to read. I think they are using an algorithm to estimate time
How can I do that!
Thanks in advance
The words read per minute average is about 250-300, once you know this you just need to:
Get the article word count.
Divide this number by 275 (more or less).
Round the result to get a integer number of minutes.
According to a study conducted in 2012, the average reading speed of an adult for text in English is: 228±30 words, 313±38 syllables, and 987±118 characters per minute.
You can therefore calculate an average time to read a particular article by counting one of these factors and dividing by that average speed. Syllables per minute is probably the most accurate, but for computers, words and characters are easier to count.
Study Citation:
Standardized Assessment of Reading Performance: The New International Reading Speed Texts IReST by
Susanne Trauzettel-Klosinski; Klaus Dietz; the IReST Study Group, published in Investigative Ophthalmology & Visual Science August 2012, Vol.53, 5452-5461
A nice solution here, on how
to get an estimated read time of any article or blog post https://stackoverflow.com/a/63820743/12490386
here's an easy one
Divide your total word count by 200.
You’ll get a decimal number, in this case, 4.69. The first part of your decimal number is your minute. In this case, it’s 4.
Take the second part — the decimal points — and multiply that by 0.60. Those are your seconds. Round up or down as necessary to get a whole second. In this case, 0.69 x 0.60 = 0.414. We’ll round that to 41 seconds.
The result? 938 words = a 4 minute, 41 second read.

What factors make training (fit()) extremely slow for training set of 5,000?

I am running a fit() using a training set of about 5,000 rows, using LogisticRegression as the classifier. I am using CrossValidator and a parameter grid (maybe around 480 combinations in total, assuming each parameter is tried alongside all combinations of the other parameters)
This is running locally ("local[*]" -- so all available cores should be used), and is assigned 12GB of RAM. The training set is small compared to what we will eventually have.
This is running for days -- not what I expected. Can someone provide some tips / explanation of the main areas that might affect this performance?
I'd rather not set up Spark as a cluster unless strictly necessary. I would have thought this was not a monumental task.
An example of the param grid:
return new ParamGridBuilder()
.addGrid(classifier.regParam(), new double[]{0.0, 0.1, 0.01, .3, .9})
.addGrid(classifier.fitIntercept())
.addGrid(classifier.maxIter(), new int[]{10, 20, 100})
.addGrid(classifier.elasticNetParam(), new double[]{.8, .003})
.addGrid(classifier.threshold(), new double[]{0.0, .03, 0.5, 1.0})
.addGrid(classifier.standardization());
Any suggestions?
Well, 480 models has to be trained and tested.. This is huge amount of work.
I suggest yo do some manual exploration to determine where to do GridSearch..
For instance, you could validate that fitIntercept is good or not just once (instead of 240 times with one value and 240 times with othe other value)
Same thing with standardization..
Some of theses parameters are black/white decision to make.

Executing genetic algorithm per configuration

I'm trying to understand a paper I'm reading regarding genetic algorithms.
They running there a GA with parameters they presented.
Some of the parameters are:
stop criterion - 50 generations.
Runs per configuration - 30.
After they presented the parameters they said that they executed the algorithm 20 times.
I don't understand two things:
1. what the second parameter means? that every configuration runs untill it reaches 30 generations?
2. when they execute the algorithm 20 times, it means untill they reach 20 generations or that they executed 20 configurations?
thanks.
Question 1
Usually in GA:s parameters such as mutation rates, selection rates, etc, need to be customised for the problem at hand. To get a somewhat good measure to quantify what is a good or bad parameter configuration, each configuration is repeated for a number of runs, whereafter the mean and standard deviation of the best solution in each of these runs are computed. Note that these runs have nothing to do with the number of generations or specific parameters in the GA, it's simply a way to reduce noise and increase reliability when comparing and choosing among different parameter configurations.
On the other hand, "number of runs" in itself can also be a parameter of study, but it's quite obvious that increasing the total number of runs will increase the chance to find a really good solution (say a large deviation from mean GA performance). If studying such a case, it's important that the total number of generations (effectively the total number of function evaluations), over all runs, are the same between different parameter settings.
As an example, consider the following parameter configuration:
numberOfRuns = 30
numberOfGenerations = 50
==> a total of 1500 generation analysed
studied vs configuration:
numberOfRuns = 50
numberOfGenerations = 30
==> a total of 1500 generation analysed
A study like this could examine if it is favourable to have more generations in each GA run (first scenario), of if it's better to have fewer generations but more runs (second scenario: probably favourable for a GA with quick convergence to local optima but with large standard deviation).
The following parameter configuration, however, would not have any meaning to benchmark against the two above, since it has a larger number of total generations:
numberOfRuns = 50
numberOfGenerations = 50
==> a total of 2500 generation analysed
Question 2
If they say they execute the algorithm 20 times, it would generally be equivalent with 20 runs as per described above. Hence, 20 runs each running over 50 generations.

Fast coincidence search algorithm?

A cry for help to all those good at suggesting fast algorithms, specifically search algorithms!
I'm trying to optimise some MATLAB code which looks for 'coincidences' in a list of chronologically ordered times. A 'coincidence' is defined two or more time that occur within a given time window of one another. For example, if we have the following times:
100
150
210
220
380
500
520
610
and I wanted to look for 'coincidences' within 100 of each other then the following would be returned [100 150], [210 220], [500 520]. Note that each time can only ever be included in one coincidence event, so [150 210 220] is not valid as a three-way coincidence because 150 has already been used in [100 150].
My times are sorted chronologically, so my current MATLAB code simply scrolls a 'window' of 100 through the list and pick out those times which 'fall in'. This works, and isn't too slow, but I wondered if there was a more efficient solution which I've missed? Surely there are some clever tricks which can be played here?
Any help would be greatly appreciated!
Short answer - no. If I understand you "window scrolling" correctly - you are going through the list, picking each element as the lower point and checking its bigger neighbors. If the neighbor falls within the 100 range you add it to the group. If not you close the group and use the neighbor as the new lower point. Since you only go through each element once that way you cannot improve the complexity of your current algorithm, which is already o(n).

How to rank stories based on "controversy"?

I'd like to rank my stories based on "controversy" quotient. For example, reddit.com currently has "controversial" section: http://www.reddit.com/controversial/
When a story has a lot of up and a lot of down votes, it's controversial even though the total score is 0 (for example). How should I calculate this quotient score so that when there's a lot of people voting up and down, I can capture this somehow.
Thanks!!!
Nick
I would recommend using the standard deviation of the votes.
A controversial vote that's 100% polarised would have equal numbers of -1 and +1 votes, so the mean would be 0 and the stddev would be around 1.0
Conversely a completely consistent set of votes (with no votes in the opposite direction) would have a mean of 1 or -1 and a stddev of 0.0.
Votes that aren't either completely consistent or completely polarised will produce a standard deviation figure between 0 and ~1.0 where that value will indicate the degree of controversy in the vote.
The easiest method is to count the number of upvote/downvote pairings for a given comment within the timeframe (e.g. 1 week, 48 hours etc), and have comments with the most parings appear first. Anything more complex requires trial-and-error or experimentation on the best algorithm - as always, it varies on the content of the site and how you want it weighted.
Overall, it's not much different than a hotness algorithm, which works by detecting the most upvotes or views within a timeframe.
What about simply getting the smaller of the two values (up or down) of a point in time? If it goes up a lot and goes down a little, or the other way around it, is not controversial.
If for example the items has 10 ups and 5 downs, the "controversiality level" is 5, since there is 5 people disagreeing about liking it or not. On the other hand if it has either 10 ups or 10 downs, the "controversiality level" is 0, since no one is disagreeing.
So in the end the smaller of both items in this case defines the "hotness" or the "controversiality". Does this make sense?
// figure out if up or down is winning - doesn't matter which
if (up_votes > down_votes)
{
win_votes = up_votes;
lose_votes = down_votes;
}
else
{
win_votes = down_votes;
lose_votes = up_votes;
}
// losewin_ratio is always <= 1, near 0 if win_votes >> lose_votes
losewin_ratio = lose_votes / win_votes;
total_votes = up_votes + down_votes;
controversy_score = total_votes * losewin_ratio; // large means controversial
This formula will produce high scores for stories that have a lot of votes and a near 50/50 voting split, and low scores for stories that have either few votes or many votes for one choice.

Resources