SPSS Syntax- new variable based on 3 variables - syntax

I need to create a new variable based on 3 variables.
If someone is coded as 1 against any 1 of 3 variables, they are coded as 1 in the new variable
If they don’t code 1 on any variable, but code as 2 against any 1 of 3 variables they are coded as 2 in the new variable
Everything else is coded as 99
In syntax, I’ve written this as:
IF (Keep_Any=1 OR Find_Any=1 OR Improve_Any=1) Keep_Find_Improve=1.
IF ((Keep_Find_Improve~= 1) & (Keep_Any=2 | Find_Any=2 | Improve_Any=2)) Keep_Find_Improve=2.
IF (Keep_Find_Improve~=1 & Keep_Find_Improve~=2) Keep_Find_Improve=99.
EXECUTE.
However, the first part correctly identifies cases coded as 1, but the rest of the syntax doesn't work. This is in despite of using some syntax that uses the exact same logic on other variables:
COMPUTE Keep_Any= Q9a_recoded = 1 | Q9b_recoded = 1 | Q9c_recoded = 1.
EXECUTE.
IF (Q9A_recoded=1 OR Q9B_recoded=1 OR Q9C_recoded=1) Keep_Any=1.
IF ((Keep_Any~=1) & (Q9A_recoded= 2 OR Q9B_recoded=2 OR Q9C_recoded=2)) Keep_Any=2.
IF (Keep_Any~=1 & Keep_Any~=2) Keep_Any=99.
EXECUTE.
Does anyone have any ideas of why this is happening and how to fix it?

Try:
COMPUTE Target1=99.
COMPUTE Target1=ANY(2, V1, V2, V3).
COMPUTE Target1=ANY(1, V1, V2, V3).
Or better still (more efficient, even if more lines of code)
DO IF ANY(1, V1, V2, V3)=1.
COMPUTE Target2= 1.
ELSE IF ANY(2, V1, V2, V3)=1.
COMPUTE Target2 = 2.
ELSE.
COMPUTE Target2=99.
END IF.

The reason your syntax isn't working is the second condition you are using:
IF ((Keep_Find_Improve~= 1) & ...
When the first condition wasn't met, Keep_Find_Improve is still missing, and therefore doesn't meet the condition of ~=1.
The same thing goes later for IF (Keep_Any~=1 & Keep_Any~=2).
So you shouldn't be comparing Keep_Find_Improve to 1 or 2, you should be checking if it has received a value or if it's still missing:
IF (Keep_Any=1 OR Find_Any=1 OR Improve_Any=1) Keep_Find_Improve=1.
IF (missing(Keep_Find_Improve) & (Keep_Any=2 | Find_Any=2 | Improve_Any=2)) Keep_Find_Improve=2.
IF missing(Keep_Find_Improve) Keep_Find_Improve=99.
* alternatively: recode Keep_Find_Improve(miss=99).
EXECUTE.
All this being said, I recommend you use the more advanced code suggested by #JigneshSutar.

Related

Syntax to generate a Syntax in SPSS

I’m trying to construct a Syntax to generate a Syntax in SPSS, but I’m having some issues…
I have an excel file with metadata and I would like to use it in order to make a syntax to extract information from it (like this, if I have a huge database, I just need to keep the excel updated – add/delete variables, etc. - and then run a syntax to extract the needed information for a new syntax).
I also noticed the produced syntax has always around 15Mb, which is a lot (applied to more than 500 lines)!
I don’t use Python due to run syntax in different computers and/or configurations.
Any ideas? Can anyone please help me?
Thank you in advance.
Example:
(test.xlsx – sheet 1)
Var Code Label List Var_label (concatenate Var+Label)
V1 3 Sex 1 V1 “Sex”
V2 1 Work 2 V2 “Work”
V3 3 Country 3 V3 “Country”
V4 1 Married 2 V4 “Married”
V5 1 Kids 2 V5 “Kids”
V6 2 Satisf1 4 V6 “Satisf1”
V7 2 Satisf2 4 V7 “Satisf2”
(information from other file)
List = 1
1 “Male”
2 “Female”
List = 2
1 “Yes”
2 “No”
List = 3
1 “Europe”
2 “America”
3 “Asia”
4 “Africa”
5 “Oceania”
List = 4
1 “Very unsatisfied”
10 “Very satisfied”
I want to make a Syntax that generates a new syntax to apply “VARIABLE LABELS” and “VALUE LABELS”. So, I thought about something like this:
GET DATA
/TYPE=XLSX
/FILE="test.xlsx"
/SHEET=name 'sheet 1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0.
EXECUTE.
STRING vlb (A15) labels (A150) value (A12) lab (A1500) point (A2) separate (A50) space (A2) list1 (A100) list2 (A100).
SELECT IF (Code=1).
COMPUTE vlb = "VARIABLE LABELS".
COMPUTE labels = CONCAT (RTRIM(Var_label)," ").
COMPUTE point = ".".
COMPUTE value = "VALUE LABELS".
COMPUTE lab = CONCAT (RTRIM(Var)," ").
COMPUTE list1 = '1 " Yes "'.
COMPUTE list2 = '2 "No".'.
COMPUTE space = " ".
COMPUTE separate="************************************************.".
WRITE OUTFILE = "list_01.sps" / vlb.
WRITE OUTFILE = "list_01.sps" /labels.
WRITE OUTFILE = "list_01.sps" /point.
WRITE OUTFILE = "list_01.sps" /value.
WRITE OUTFILE = "list_01.sps" /lab.
WRITE OUTFILE = "list_01.sps" /list1.
WRITE OUTFILE = "list_01.sps" /list2.
WRITE OUTFILE = "list_01.sps" /space.
WRITE OUTFILE = "list_01.sps" /separate.
WRITE OUTFILE = "list_01.sps" /space.
If there is only one variable with same list (ex: V1), it works ok. However, if there is more than one variable having the same list, it reproduces the codes as much times as number of variables (Ex: V2, V4 and V5).
What I have (Ex: V2, V4 and V5), after running code above:
VARIABLE LABELS
V2 "Work"
.
VALUE LABELS
V2
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V4 "Married"
.
VALUE LABELS
V4
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V5 "Kids"
.
VALUE LABELS
V5
1 " Yes "
2 " No "
************************************************.
What I would like to have:
VARIABLE LABELS
V2 "Work"
V4 "Married"
V5 "Kids"
.
VALUE LABELS
V2 V4 V5
1 " Yes "
2 " No "
I think there are probably ways to automate the whole process better, including the use of your second data source. But for the scope of this question I will suggest a way to get what you asked for specifically.
The key is to build the command with special conditions for first and last lines:
string cmd1 cmd2 (a200).
sort cases by code.
match files /file=* /first=first /last=last /by code. /* marking first and last lines.
do if first.
compute cmd1="VARIABLE LABELS".
compute cmd2="VALUE LABELS".
end if.
if not first cmd1=concat(rtrim(cmd1), " /"). /* "/" only appears from the second varname.
compute cmd1=concat(rtrim(cmd1), " ", Var_label).
compute cmd2=concat(rtrim(cmd2), " ", Var).
do if last.
compute cmd1=concat(rtrim(cmd1), " .").
compute cmd2=concat(rtrim(cmd2), " ", ' 1 " Yes " 2 "No". ').
end if.
exe.
The commands are now ready, but we don't want to get them mixed up so we'll stack them one under the other, and only then write them out:
add files /file=* /rename cmd1=cmd /file=* /rename cmd2=cmd.
exe.
WRITE OUTFILE = "var definitions.sps" / cmd .
exe.
EDIT:
Note that the code above assumes you've already run a select cases if code = ... and that there is a single code in all the remaining lines.
Note also I added an exe. command at the end - without running that the new syntax will appear empty.

Get line number where first occurrence of a value appears?

I have a CSV file like below:
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
I'm trying to use find the row number where the final column has a value below a certain range. For example, below 0.6.
Using the above CSV file, I want to return 3 because E = 3 is the first row where Mean <= 0.60. If there is no value below 0.6 I want to return 0. I am in effect returning the value in the first column based on the final column.
I plan to initialize this number as a constant in gnuplot. How can this be done? I've tagged awk because I think it's related.
In case you want a gnuplot-only version... if you use a file remove the datablock and replace $Data by your filename in " ".
Edit: You can do it without a dummy table, it can be done shorter with stats (check help stats). Even shorter than the accepted solution (well, we are not at code golf here), but additionally platform-independent because it's gnuplot-only.
Furthermore, in case E could be any number, i.e. 0 as well, then it might be better
to first assign E = NaN and then compare E to NaN (see here: gnuplot: How to compare to NaN?).
Script:
### conditional extraction into a variable
reset session
$Data <<EOD
E Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Mean
1 0.7019 0.6734 0.6599 0.6511 0.701 0.6977 0.680833333
2 0.6421 0.6478 0.6095 0.608 0.6525 0.6285 0.6314
3 0.6039 0.6096 0.563 0.5539 0.6218 0.5716 0.5873
4 0.5564 0.5545 0.5138 0.4962 0.5781 0.5154 0.535733333
5 0.5056 0.4972 0.4704 0.4488 0.5245 0.4694 0.485983333
EOD
E = NaN
stats $Data u ($8<=0.6 && E!=E? E=$1 : 0) nooutput
print E
### end of script
Result:
3.0
Actually, OP wants to return E=0 if the condition was not met. Then the script would be like this:
E=0
stats $Data u ($8<=0.6 && E==0? E=$1 : 0) nooutput
Another awk. You could initialize the default return value to var ret in BEGIN but since it's 0 there is really no point as empty var+0 produces the same effect. If the threshold value of 0.6 is not met before the ENDis reached, that is returned. If it is met, exit invokes the END and ret is output:
$ awk '
NR>1 && $NF<0.6 { # final column has a value below a certain range
ret=$1 # I want to return 3 because E = 3
exit
}
END {
print ret+0
}' file
Output:
3
Something like this should do the trick:
awk 'NR>1 && $8<.6 {print $1;fnd=1;exit}END{if(!fnd){print 0}}' yourfile

Robust Standard Errors in lm() using stargazer()

I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.

Error in Parsing the postscript to pdf

I have a postscript file when i try to convert it into pdf or open it with postscript it gives the following error
undefined in execform
I am trying to fix this error. But there is no solution i found. Kindly Help me understand the issue.
This is postscript file
OK so a few observations to start;
The file is 8 pages long, uses many forms, and the first form it uses has nested forms. This really isn't suitable as an example file, you are expecting other programmers to dig through a lot of extraneous cruft to help you out. When you post an example, please try and reduce it to just the minimum required to reproduce the problem.
Have you actually tried to debug this problem yourself ? If so what did you do ? (and why didn't you start by reducing the file complexity ?)
I don't want to be offensive, but this is the third rather naive posting you've made recently, do you have much experience of PostScript programming ? Has anyone offered you any training in the language ? It appears you are working on behalf of a commercial organisation, you should talk to your line manager and try and arrange some training if you haven't already been given some.
The PostScript program does not give the error you stated
undefined in execform
In fact the error is a Ghostscript-specific error message:
Error: /undefined in --.execform1--
So that's the .execform1 operator (note the leading '.' to indicate a Ghostscript internal operator). That's only important because firstly its important to accurately quote error messages, and secondly because, for someone familiar with Ghostscript, it tells you that the error occurs while executing the form PaintProc, not while executing the execform operator.
After considerably reducing of the complexity of the file, the problem is absolutely nothing to do with the use of Forms. The offending Form executes code like this:
2 RM
0.459396 w
[(\0\1\0\2)]435.529999 -791.02002 T
(That's the first occurrence, and its where the error occurs)
That executes the procedure named T which is defined as:
/T{neg _LY add /_y ed _LX add /_x ed/_BLSY _y _BLY sub D/_BLX _x D/_BLY _y D _x _y TT}bd
Obviously that's using a number of other functions defined in the prolog, but the important point is that it executes TT which is defined as :
/TT{/_y ed/_x ed/_SX _x _LX sub D/_SY _y _LY sub D/_LX _x D/_LY _y D _x _y m 0 _rm eq{ dup type/stringtype eq{show}{{ dup type /stringtype eq{show}{ 0 rmoveto}?}forall}?} if
1 _rm eq {gsave 0 _scs eq { _sr setgray}if 1 _scs eq { _sr _sg _sb setrgbcolor}if 2 _scs eq { _sr _sg _sb _sk setcmykcolor} if dup type/stringtype eq{true charpath }{{dup type /stringtype eq{true charpath } { 0 rmoveto}?}forall}? S grestore} if
2 _rm eq {gsave 0 _fcs eq { _fr setgray}if 1 _fcs eq { _fr _fg _fb setrgbcolor}if 2 _fcs eq { _fr _fg _fb _fk setcmykcolor} if dup type/stringtype eq{true charpath }{{dup type /stringtype eq{true charpath } { 0 rmoveto}?}
forall}? gsave fill grestore 0 _scs eq { _sr setgray}if 1 _scs eq { _sr _sg _sb setrgbcolor}if 2 _scs eq { _sr _sg _sb _sk setcmykcolor}if S grestore} if
Under the conditions holding at the time TT is executed (RM sets _rm to 2), we go through this piece of code:
gsave 0 _fcs eq
However, _fcs is initially undefined, and only defined when the /fcs function is executed. Your program never executes /fcs so _fcs is undefined, leading to the error.
Is there a reason why you are defining each page in a PostScript Form ? This is not optimal, if the interpreter actually supports Forms then you are using up VM for no useful purpose (since you only execute each Form once).
If its because the original PDF input uses PDF Form XObjects I would recommend that you don't try and reproduce those in PostScript. Reuse of Form XObjects in PDF is rather rare (it does happen but non-reuse is much more common). The loss of efficiency due to describing PostScript Forms for each PDF Form XObject for all the files where the form isn't reused exceeds the benefit for the rare cases where it would actually be valuable.

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

Resources