I’m trying to construct a Syntax to generate a Syntax in SPSS, but I’m having some issues…
I have an excel file with metadata and I would like to use it in order to make a syntax to extract information from it (like this, if I have a huge database, I just need to keep the excel updated – add/delete variables, etc. - and then run a syntax to extract the needed information for a new syntax).
I also noticed the produced syntax has always around 15Mb, which is a lot (applied to more than 500 lines)!
I don’t use Python due to run syntax in different computers and/or configurations.
Any ideas? Can anyone please help me?
Thank you in advance.
Example:
(test.xlsx – sheet 1)
Var Code Label List Var_label (concatenate Var+Label)
V1 3 Sex 1 V1 “Sex”
V2 1 Work 2 V2 “Work”
V3 3 Country 3 V3 “Country”
V4 1 Married 2 V4 “Married”
V5 1 Kids 2 V5 “Kids”
V6 2 Satisf1 4 V6 “Satisf1”
V7 2 Satisf2 4 V7 “Satisf2”
(information from other file)
List = 1
1 “Male”
2 “Female”
List = 2
1 “Yes”
2 “No”
List = 3
1 “Europe”
2 “America”
3 “Asia”
4 “Africa”
5 “Oceania”
List = 4
1 “Very unsatisfied”
10 “Very satisfied”
I want to make a Syntax that generates a new syntax to apply “VARIABLE LABELS” and “VALUE LABELS”. So, I thought about something like this:
GET DATA
/TYPE=XLSX
/FILE="test.xlsx"
/SHEET=name 'sheet 1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0.
EXECUTE.
STRING vlb (A15) labels (A150) value (A12) lab (A1500) point (A2) separate (A50) space (A2) list1 (A100) list2 (A100).
SELECT IF (Code=1).
COMPUTE vlb = "VARIABLE LABELS".
COMPUTE labels = CONCAT (RTRIM(Var_label)," ").
COMPUTE point = ".".
COMPUTE value = "VALUE LABELS".
COMPUTE lab = CONCAT (RTRIM(Var)," ").
COMPUTE list1 = '1 " Yes "'.
COMPUTE list2 = '2 "No".'.
COMPUTE space = " ".
COMPUTE separate="************************************************.".
WRITE OUTFILE = "list_01.sps" / vlb.
WRITE OUTFILE = "list_01.sps" /labels.
WRITE OUTFILE = "list_01.sps" /point.
WRITE OUTFILE = "list_01.sps" /value.
WRITE OUTFILE = "list_01.sps" /lab.
WRITE OUTFILE = "list_01.sps" /list1.
WRITE OUTFILE = "list_01.sps" /list2.
WRITE OUTFILE = "list_01.sps" /space.
WRITE OUTFILE = "list_01.sps" /separate.
WRITE OUTFILE = "list_01.sps" /space.
If there is only one variable with same list (ex: V1), it works ok. However, if there is more than one variable having the same list, it reproduces the codes as much times as number of variables (Ex: V2, V4 and V5).
What I have (Ex: V2, V4 and V5), after running code above:
VARIABLE LABELS
V2 "Work"
.
VALUE LABELS
V2
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V4 "Married"
.
VALUE LABELS
V4
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V5 "Kids"
.
VALUE LABELS
V5
1 " Yes "
2 " No "
************************************************.
What I would like to have:
VARIABLE LABELS
V2 "Work"
V4 "Married"
V5 "Kids"
.
VALUE LABELS
V2 V4 V5
1 " Yes "
2 " No "
I think there are probably ways to automate the whole process better, including the use of your second data source. But for the scope of this question I will suggest a way to get what you asked for specifically.
The key is to build the command with special conditions for first and last lines:
string cmd1 cmd2 (a200).
sort cases by code.
match files /file=* /first=first /last=last /by code. /* marking first and last lines.
do if first.
compute cmd1="VARIABLE LABELS".
compute cmd2="VALUE LABELS".
end if.
if not first cmd1=concat(rtrim(cmd1), " /"). /* "/" only appears from the second varname.
compute cmd1=concat(rtrim(cmd1), " ", Var_label).
compute cmd2=concat(rtrim(cmd2), " ", Var).
do if last.
compute cmd1=concat(rtrim(cmd1), " .").
compute cmd2=concat(rtrim(cmd2), " ", ' 1 " Yes " 2 "No". ').
end if.
exe.
The commands are now ready, but we don't want to get them mixed up so we'll stack them one under the other, and only then write them out:
add files /file=* /rename cmd1=cmd /file=* /rename cmd2=cmd.
exe.
WRITE OUTFILE = "var definitions.sps" / cmd .
exe.
EDIT:
Note that the code above assumes you've already run a select cases if code = ... and that there is a single code in all the remaining lines.
Note also I added an exe. command at the end - without running that the new syntax will appear empty.
I basically want to know the best to convert my BigDecimal numbers to readable format (float perhaps) for the purposes of displaying them to the client:
In order to figure out liabilities, I do contributions - distributions from each address.
For example, if a person contributes 2 units to an address and that same address distributes 1 unit back to the person, then there is still a 1 unit liability. that's what's going on below. These numbers are all units of cryptocurrency.
Here's another example:
Say address1 and address2 contribute 2 coins each to address3, and address3 distributes 1.0 coins to address1 and 0.5 coins to address2, then address3 has a 1.0 coin liability to address1 and a 1.5 coin liability to address2.
So the actual data using BigDecimal below:
contributions =
{"1444"=>#<BigDecimal:7f915c08f030,'0.502569E2',18(36)>,
"alice"=>#<BigDecimal:7f915c084018,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915c0a4430,'0.87161E1',18(36)>,
"address2"=>#<BigDecimal:7f915c0943f0,'0.84811E1',18(36)>,
"address3"=>#<BigDecimal:7f915c0a43e0,'0.385E0',9(18)>,
"address6"=>#<BigDecimal:7f915c09ebe8,'0.1E1',9(18)>,
"address7"=>#<BigDecimal:7f915c09eb98,'0.1E1',9(18)>,
"address8"=>#<BigDecimal:7f915c09d428,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915c09d3d8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915c0a7540,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915c0af8a8,'0.392E1',18(36)>,
"address12"=>#<BigDecimal:7f915c0a4980,'0.14E1',18(36)>,
"address13"=>#<BigDecimal:7f915c0af858,'0.2133333333 3333333333 33333334E1',36(54)>,
"address14"=>#<BigDecimal:7f915c0a54c0,'0.3533333333 3333333333 33333334E1',36(45)>,
"address15"=>#<BigDecimal:7f915c0a66e0,'0.1533333333 3333333333 33333334E1',36(36)>,
"sdfds"=>#<BigDecimal:7f915c0a6118,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915c0a6028,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915c0ae688,'0.3E0',9(18)>,
"address21"=>#<BigDecimal:7f915c0ae638,'0.3E0',9(18)>,
"address23"=>#<BigDecimal:7f915c0ae070,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915c0adf80,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915c0ad328,'0.1E0',9(18)>,
"add2"=>#<BigDecimal:7f915c0ad2d8,'0.1E0',9(18)>,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
and the distributions:
distributions =
{"1444"=>#<BigDecimal:7f915a9068f0,'0.502569E2',18(63)>,
"alice"=>#<BigDecimal:7f915a8f44e8,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915a906800,'0.87161E1',18(54)>,
"address2"=>#<BigDecimal:7f915a906710,'0.84811E1',18(54)>,
"address3"=>#<BigDecimal:7f915a906620,'0.385E0',9(36)>,
"address6"=>#<BigDecimal:7f915a8fdea8,'0.1E1',9(27)>,
"address7"=>#<BigDecimal:7f915a8fddb8,'0.1E1',9(27)>,
"address8"=>#<BigDecimal:7f915a8fd5e8,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915a8fd4f8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915a8fc9b8,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915a9071b0,'0.3920000000 0000003E1',27(45)>,
"address12"=>#<BigDecimal:7f915a907660,'0.1400000000 0000001E1',27(36)>,
"address13"=>#<BigDecimal:7f915a9070c0,'0.2133333333 3333337E1',27(45)>,
"address14"=>#<BigDecimal:7f915a906530,'0.3533333333 3333333333 33333334E1',36(54)>,
"address15"=>#<BigDecimal:7f915a8fc148,'0.1533333333 3333334E1',27(27)>,
"sdfds"=>#<BigDecimal:7f915a907f98,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915a907e08,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915a906ad0,'0.3000000000 0000003E0',18(27)>,
"address21"=>#<BigDecimal:7f915a9069e0,'0.3000000000 0000003E0',18(27)>,
"address23"=>#<BigDecimal:7f915a9063c8,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915a906238,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915a9060a8,'0.5E-1',9(27)>,
"add2"=>#<BigDecimal:7f915a905f18,'0.1E0',9(27)>}
Ideally, I want my liabilities to look like this:
{"add1"=>0.05,
"addx"=>0.1>,
"addy"=>0.1>}
But they look like this:
{"1444"=>0.0,
"alice"=>0.0,
"address1"=>0.0,
"address2"=>0.0,
"address3"=>0.0,
"address6"=>0.0,
"address7"=>0.0,
"address8"=>0.0,
"address9"=>0.0,
"address10"=>0.0,
"address11"=>-3.0e-16,
"address12"=>-1.0e-16,
"address13"=>-3.66666666666e-16,
"address14"=>0.0,
"address15"=>-6.6666666666e-17,
"sdfds"=>0.0,
"sf"=>0.0,
"address20"=>-3.0e-17,
"address21"=>-3.0e-17,
"address23"=>0.0,
"address22"=>0.0,
"add1"=>0.05,
"add2"=>0.0,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
I don't want to include -3.66666666666e-16 because that's essentially 0 and even Ruby accounts for it this way when you run -3.66666666666e-16 > 0... it returns false.
This is what I have... is the a better way? The code below is calculating the liabilities by subtracting con from dis and only selecting the liabilities that are greater than 0.0...that makes sense to me and it excludes 1-time grants of coins (there must be a matching contribution to be a liability). Then, I convert everything to floats so it's readable. Does this look right?
liab = #contributions.merge(#distributions) do |key, con, dis|
(con - dis)
end.select { |addr, amount| amount > 0.0 && #contributions.keys.include?(addr) }
liab.merge(liab) do |k, old, new|
k = new.to_f
end
I want the amount returned in float format, not the big decimal object. Is what I'm doing okay? Will I keep accuracy if I convert to float at the end?
I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.
I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.