How to "debug" StanfordNLP text classifier - stanford-nlp

I'm using StanfordNLP to do text classification. I have a training set with two labels: YES and NO. Both labels have more or less the same datums per label (~= 120K).
The problem is that StanfordNLP is misclassifying some text, and I'm not able to identify why. How do I debug it?
My train file look like:
YES guarda-roupa/roupeiro 2 portas de correr
YES guarda-roupa/roupeiro 3 portas
YES guarda roupa , roupeiro 3 portas
YES guarda-roupa 4 portas
YES guarda roupa 6p mdf
YES guardaroupas 3 portas
YES jogo de quarto com guarda-roupa 2 portas + cômoda + berço
YES guarda roupa 4pts
NO base para guarda-sol
NO guarda-sol alumínio
NO guarda chuva transparente
NO coifa guarda po alavanca cambio
NO lancheira guarda do leao vermelha
NO hard boiled: queima roupa
NO roupa nova do imperador
NO suporte para passar roupa
The YES label identifies "guarda roupa" (wardrobe) and NO identifies things that aren't "guarda roupa" but have one or more commons words (such as "guarda chuva" -- umbrella, or "roupa" -- clothes).
I don't know why, but my model insists to classify "guarda roupa" (and its variations such as "guardaroupa", "guarda-roupas", etc) as NO...
How do I debug it? I already double checked my train file in order to see if I misclassified something, introducing an error, but I could not find it...
Any advice is welcome.
UPDATE 1
I'm using the following properties in order to control features creation:
useClassFeature=false
featureMinimumSupport=2
lowercase=true
1.useNGrams=false
1.usePrefixSuffixNGrams=false
1.splitWordsRegexp=\\s+
1.useSplitWordNGrams=true
1.minWordNGramLeng=2
1.maxWordNGramLeng=5
1.useAllSplitWordPairs=true
1.useAllSplitWordTriples=true
goldAnswerColumn=0
displayedColumn=1
intern=true
sigma=1
useQN=true
QNsize=10
tolerance=1e-4
UPDATE 2
Searching the API, I discovered that ColumnDataClassifier has a method getClassifier() that gives access to the underlying LinearClassifier, which has a dump() method. The dump produces an output that looks like bellow. From API: "Print all features in the classifier and the weight that they assign to each class."
YES NO
1-SW#-guarda-roupa-roupeiro-2portas 0,01 -0,01
1-ASWT-guarda-roupa-roupeiro 0,19 -0,19
1-SW#-guarda-roupa-roupeiro 0,19 -0,19
If I do a toString() into LinearClassifier it will print:
[-0.7, -0.7+0.1): 427.0 [(1-SW#-guarda-roupa-roupeiro-2portas,NO), ...]
[0.6, 0.6+0.1): 427.0 [(1-SW#-guarda-roupa-roupeiro-2portas,YES), ...]

Related

Allocating and scheduling tasks into rooms with conditions - optimization algorithm

I need to find a suitable method to be based on, for developing an optimization algorithm which does the following:
Let's say we have N tasks to do, and we have M rooms that each one of them contains some specific number of infrastructure/conditions.
Each task demands using room with a suitable conditions for the task.
For example, to get task A done we need to use water tap and gas piping, so we can only use rooms that contain those ones.
Also, for each task we have a predefined due date.
I hope I've explained it well enough.
So, I need to develop an algorithm which can allocate the tasks for each room in a proper scheduling, so I could do all of the tasks at the minimum total time and without exceeding deadline times (and if exceeding is inevitable, then getting the least worst answer).
What are an existing methods or algorithm I can get based on and learn from them?
I though about 'Job Shop', but I wonder if there are other suitable algorithms that can handle problems like that.
This is not an algorithm but a Mixed Integer Programming model. I am not sure if this is what you are looking for.
Assumptions: only one job can execute at the same time in a room. Jobs in different rooms can execute in parallel. Also, to keep things simple, I assume the problem is feasible (the model will detect infeasible problems but we don't return a solution if this is the case).
So we introduce a number of decision variables:
assign(i,j) = 1 if task i is assigned to room j
0 otherwise
finish(i) = time job i is done processing
makespan = finishing time of the last job
With this we can formulate the MIP model:
The following data is used:
Length(i) = processing time of job i
M = a large enough constant (say the planning horizon)
DueDate(i) = time job i must be finished
Allowed(i,j) = Yes if job i can be executed in room j
Importantly, I assume jobs are ordered by due date.
The first constraint says: if job i runs in room j then it finishes just after the previous jobs running in that room. The second constraint is a bound: a job must finish before its due date. The third constraint says: each job must be assigned to exactly one room where it is allowed to execute. Finally, the makespan is the last finish time.
To test this, I generated some random data:
---- 37 SET use resource usage
resource1 resource2 resource3 resource4 resource5
task2 YES
task3 YES
task5 YES
task7 YES
task9 YES YES
task11 YES
task12 YES YES
task13 YES
task14 YES
task15 YES
task16 YES YES
task17 YES
task20 YES YES
task21 YES YES
task23 YES
task24 YES
task25 YES YES
task26 YES
task28 YES
---- 37 SET avail resource availability
resource1 resource2 resource3 resource4 resource5
room1 YES YES YES YES
room2 YES YES
room3 YES YES
room4 YES YES YES YES
room5 YES YES YES YES
The set Allowed is calculated from use(i,r) and avail(j,r) data:
---- 41 SET allowed task is allowed to be executed in room
room1 room2 room3 room4 room5
task1 YES YES YES YES YES
task2 YES YES YES YES
task3 YES YES YES YES
task4 YES YES YES YES YES
task5 YES YES YES YES
task6 YES YES YES YES YES
task7 YES YES
task8 YES YES YES YES YES
task9 YES
task10 YES YES YES YES YES
task11 YES YES YES YES
task12 YES
task13 YES YES
task14 YES YES
task15 YES YES YES YES
task16 YES YES YES
task17 YES YES
task18 YES YES YES YES YES
task19 YES YES YES YES YES
task20 YES
task21 YES
task22 YES YES YES YES YES
task23 YES YES
task24 YES YES YES YES
task25 YES YES
task26 YES YES YES YES
task27 YES YES YES YES YES
task28 YES YES YES YES
task29 YES YES YES YES YES
task30 YES YES YES YES YES
We also have random due dates and processing times:
---- 33 PARAMETER length job length
task1 2.335, task2 4.935, task3 4.066, task4 1.440, task5 4.979, task6 3.321, task7 1.666
task8 3.573, task9 2.377, task10 4.649, task11 4.600, task12 1.065, task13 2.475, task14 3.658
task15 3.374, task16 1.138, task17 4.367, task18 4.728, task19 3.032, task20 2.198, task21 2.986
task22 1.180, task23 4.095, task24 3.132, task25 3.987, task26 3.880, task27 3.526, task28 1.460
task29 4.885, task30 3.827
---- 33 PARAMETER due job due dates
task1 5.166, task2 5.333, task3 5.493, task4 5.540, task5 6.226, task6 8.105
task7 8.271, task8 8.556, task9 8.677, task10 8.922, task11 10.184, task12 11.711
task13 11.975, task14 12.814, task15 12.867, task16 14.023, task17 14.200, task18 15.820
task19 15.877, task20 16.156, task21 16.438, task22 16.885, task23 17.033, task24 17.813
task25 21.109, task26 21.713, task27 23.655, task28 23.977, task29 24.014, task30 24.507
When I run this model, I get as results:
---- 129 PARAMETER results
start length finish duedate
room1.task1 2.335 2.335 5.166
room1.task9 2.335 2.377 4.712 8.677
room1.task11 4.712 4.600 9.312 10.184
room1.task20 9.312 2.198 11.510 16.156
room1.task23 11.510 4.095 15.605 17.033
room1.task30 15.605 3.827 19.432 24.507
room2.task6 3.321 3.321 8.105
room2.task10 3.321 4.649 7.971 8.922
room2.task15 7.971 3.374 11.344 12.867
room2.task24 11.344 3.132 14.476 17.813
room2.task29 14.476 4.885 19.361 24.014
room3.task2 4.935 4.935 5.333
room3.task8 4.935 3.573 8.508 8.556
room3.task18 8.508 4.728 13.237 15.820
room3.task22 13.237 1.180 14.416 16.885
room3.task27 14.416 3.526 17.943 23.655
room3.task28 17.943 1.460 19.403 23.977
room4.task3 4.066 4.066 5.493
room4.task4 4.066 1.440 5.506 5.540
room4.task13 5.506 2.475 7.981 11.975
room4.task17 7.981 4.367 12.348 14.200
room4.task21 12.348 2.986 15.335 16.438
room4.task25 15.335 3.987 19.322 21.109
room5.task5 4.979 4.979 6.226
room5.task7 4.979 1.666 6.645 8.271
room5.task12 6.645 1.065 7.710 11.711
room5.task14 7.710 3.658 11.367 12.814
room5.task16 11.367 1.138 12.506 14.023
room5.task19 12.506 3.032 15.538 15.877
room5.task26 15.538 3.880 19.418 21.713
Detail: based on the assignment I recalculated the start and finish times. The model can allow some slack here and there as long as it does not interfere with the objective and the due dates. To get rid of any possible slacks, I just execute all jobs as early as possible. Just back-to-back execution of jobs in the same room using the job ordering (remember I sorted jobs according to due date).
This model with 30 jobs and 10 rooms took 20 seconds using Cplex. Gurobi was about the same.
Augmenting the model to handle infeasible models is not very difficult. Allow jobs to violate the due date but at a price. A penalty term needs to be added to the objective. The due date constraint is in the above example a hard constraint, and with this technique, we make it a soft constraint.
I used a small variant of Alex's OPL CP Optimizer model on the data and it finds the optimal solution (makespan=19.432) within a couple of seconds and proves optimality in about 5s on my laptop. I think a big advantage of a CP Optimizer model is that it would scale to much larger instances and easily produce good quality solutions even if, for large instances, proving optimality may be challenging of course.
Here is my version of the CP Optimizer model:
using CP;
int N = 30; // Number of tasks
int M = 5; // Number of rooms
int Length [1..N] = ...; // Task length
int DueDate[1..N] = ...; // Task due date
{int} Rooms[1..N] = ...; // Possible rooms for task
tuple Alloc { int job; int room; }
{Alloc} Allocs = {<i,r> | i in 1..N, r in Rooms[i]};
dvar interval task[i in 1..N] in 0..DueDate[i] size Length[i];
dvar interval alloc[a in Allocs] optional;
minimize max(i in 1..N) endOf(task[i]);
subject to {
forall(i in 1..N) { alternative(task[i], all(r in Rooms[i]) alloc[<i,r>]); }
forall(r in 1..M) { noOverlap(all(a in Allocs: a.room==r) alloc[a]); }
}
Note also that the MIP model exploits a problem specific dominance rule that the tasks allocated to a particular room can be ordered by increasing due-date. While this is perfectly true for this simple version of the problem, this assumption may not hold anymore in the presence of additional constraints (as for instance a minimal start time for the tasks). The CP Optimizer formulation does not make this assumption.
Within CPLEX you can rely on MIP but you could also use CPOptimizer scheduling.
In OPL your model would look like
using CP;
int N = 30; // nbTasks
int M = 10; // rooms
range Tasks = 1..N;
range Rooms = 1..M;
int taskDuration[i in Tasks]=rand(20);
int dueDate[i in Tasks]=20+rand(20);
int possible[j in Tasks][m in Rooms] = (rand(10)>=8);
dvar interval itvs[j in Tasks][o in Rooms] optional in 0..100 size taskDuration[j] ;
dvar interval itvs_task[Tasks];
dvar sequence rooms[m in Rooms] in all(j in Tasks) itvs[j][m];
execute {
cp.param.FailLimit = 10000;
}
minimize max(j in Tasks) endOf(itvs_task[j]);
subject to {
// alternative
forall(t in Tasks) alternative(itvs_task[t],all(m in Rooms)itvs[t][m]);
// one room is for one task at most at the same time
forall (m in Rooms)
noOverlap(rooms[m]);
// due dates
forall(j in Tasks) endOf(itvs_task[j]) <=dueDate[j];
}
and give
In Pyomo Erwin's MIP can be implemented like:
################################################################################
# Sets
################################################################################
model.I = Set(initialize=self.resource_usage.keys(), doc='jobs to run')
model.J = Set(initialize=self.resource_availability.keys(), doc='rooms')
model.ok = Set(initialize=self.ok.keys())
################################################################################
# Params put at model
################################################################################
model.length = Param(model.I, initialize=self.length)
model.due_date = Param(model.I, initialize=self.due_date)
################################################################################
# Var
################################################################################
model.x = Var(model.I, model.J, domain=Boolean, initialize=0, doc='job i is assigned to room j')
model.finish = Var(model.I, domain=NonNegativeReals, initialize=0, doc='finish time of job i')
model.makespan = Var(domain=NonNegativeReals, initialize=0)
################################################################################
# Constraints
################################################################################
M = 100
def all_jobs_assigned_c(model, i):
return sum(model.x[ii, jj] for (ii, jj) in model.ok if ii == i) == 1
model.all_jobs_assigned_c = Constraint(model.I, rule=all_jobs_assigned_c)
def finish1_c(model, i, j):
return sum(
model.length[ii] * model.x[ii, jj] for (ii, jj) in model.ok if jj == j and ii <= i
) - M * (1 - model.x[i, j]) <= model.finish[i]
model.finish1_c = Constraint(model.I, model.J, rule=finish1_c)
model.finish2_c = Constraint(
model.I, rule=lambda model, i: model.finish[i] <= model.due_date[i]
)
model.makespan_c = Constraint(
model.I, rule=lambda model, i: model.makespan >= model.finish[i]
)
################################################################################
# Objective
################################################################################
def obj_profit(model):
return model.makespan
model.objective = Objective(rule=obj_profit, sense=minimize)
Solving with CBC took with 4 cores about 2min and results in:

Robust Standard Errors in lm() using stargazer()

I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.

Ruby splitting a record into multiple records based on contents of a field

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"
If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

setTargetReturn in fPortfolio package

I have a three asset portfolio. I need to set the target return for my second asset
whenever i try i get this error
asset.ts <- as.timeSeries(asset.ret)
spec <- portfolioSpec()
setSolver(spec) <- "solveRshortExact"
constraints <- c("Short")
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))
efficientPortfolio(asset.ts, spec, constraints)
Error: is.numeric(targetReturn) is not TRUE
Title:
MV Efficient Portfolio
Estimator: covEstimator
Solver: solveRquadprog
Optimize: minRisk
Constraints: Short
Portfolio Weights:
MSFT AAPL NORD
0 0 0
Covariance Risk Budgets:
MSFT AAPL NORD
Target Return and Risks:
mean mu Cov Sigma CVaR VaR
0 0 0 0 0 0
Description:
Sat Apr 19 15:03:24 2014 by user: Usuario
i have tried and i have searched the web but i have no idea how to set the target return
for a specific expected return of the data set. i could copy the mean of my second asset # but i think due to decimal it could affect the answer.
I ran into this error , when using 2 assets.
Appears to be a bug in the PortOpt methods.
When there's 2 assets, it runs : .mvSolveTwoAssets
Which looks for the TargetReturn in the portfolioSpecs.
But as you know, targetReturn isn't always needed.
But in your code , you have 2 separate variables for spec.
'spec' , and 'Spec'
i.e.: 'Spec' .. assuming this is a typo, then this line needs to be changed.
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))

How do you exclude a specific character pattern with regular expressions

I working with some regular expression matching and I'm trying to figure out how you would exclude a specific character pattern. Specifically, I want to exclude the following pattern:
5 - #in words: digit, space, dash & space)
I know how to exclude the components individually: [^5 ^-] but I'm looking to exclude the specific pattern. Is this possible?
Update - I'm using Ruby as my programming language.
Here is some sample input and desired output.:
Input: 1 - Blue-Stork Stables; 2 - Young, Robert, S.; 3 - Seahorse Stable; 4 - Carney, Elvis; 5 - Guerrero, Juan, Carlos-Martin; 6 - Dubb, Michael; 7 - Summers, Hope; 8 - DTH Stables; 9 - Peebles, Matthew\n
the desired output would be:
Output: Blue-Stork Stables; Young, Robert, S.; Seahorse Stable; Carney, Elvis; Guerrero, Juan, Carlos-Marting; Dubb, Michael; Summers, Hope; DTH Stables; Peebles, Matthew\n
Please take note of the dashes on Blue-Stork Stables and Juan Carlos-Martin.
EDIT: So you mean "remove", not "exclude". No problem:
result = subject.gsub(/\d+ - /, '')
transforms your input into the desired output. I've taken the liberty to allow more than one digit (after all, if numbers reach 10 or higher, you probably want to remove those entirely, too. Right?).
(Old answer for "historical reasons")
Depending on what you mean by "exclude", it appears that you're looking for negative lookahead assertions:
^(?!.*\d - )
will fail on strings that contain 5 - anywhere and succeed on all other strings:
"5 - " // fail
"5 -" // match
"abc5 - xyz" // fail
"foobar5 - " // fail

Resources