Ruby splitting a record into multiple records based on contents of a field - ruby

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"

If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

Related

Syntax to generate a Syntax in SPSS

I’m trying to construct a Syntax to generate a Syntax in SPSS, but I’m having some issues…
I have an excel file with metadata and I would like to use it in order to make a syntax to extract information from it (like this, if I have a huge database, I just need to keep the excel updated – add/delete variables, etc. - and then run a syntax to extract the needed information for a new syntax).
I also noticed the produced syntax has always around 15Mb, which is a lot (applied to more than 500 lines)!
I don’t use Python due to run syntax in different computers and/or configurations.
Any ideas? Can anyone please help me?
Thank you in advance.
Example:
(test.xlsx – sheet 1)
Var Code Label List Var_label (concatenate Var+Label)
V1 3 Sex 1 V1 “Sex”
V2 1 Work 2 V2 “Work”
V3 3 Country 3 V3 “Country”
V4 1 Married 2 V4 “Married”
V5 1 Kids 2 V5 “Kids”
V6 2 Satisf1 4 V6 “Satisf1”
V7 2 Satisf2 4 V7 “Satisf2”
(information from other file)
List = 1
1 “Male”
2 “Female”
List = 2
1 “Yes”
2 “No”
List = 3
1 “Europe”
2 “America”
3 “Asia”
4 “Africa”
5 “Oceania”
List = 4
1 “Very unsatisfied”
10 “Very satisfied”
I want to make a Syntax that generates a new syntax to apply “VARIABLE LABELS” and “VALUE LABELS”. So, I thought about something like this:
GET DATA
/TYPE=XLSX
/FILE="test.xlsx"
/SHEET=name 'sheet 1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0.
EXECUTE.
STRING vlb (A15) labels (A150) value (A12) lab (A1500) point (A2) separate (A50) space (A2) list1 (A100) list2 (A100).
SELECT IF (Code=1).
COMPUTE vlb = "VARIABLE LABELS".
COMPUTE labels = CONCAT (RTRIM(Var_label)," ").
COMPUTE point = ".".
COMPUTE value = "VALUE LABELS".
COMPUTE lab = CONCAT (RTRIM(Var)," ").
COMPUTE list1 = '1 " Yes "'.
COMPUTE list2 = '2 "No".'.
COMPUTE space = " ".
COMPUTE separate="************************************************.".
WRITE OUTFILE = "list_01.sps" / vlb.
WRITE OUTFILE = "list_01.sps" /labels.
WRITE OUTFILE = "list_01.sps" /point.
WRITE OUTFILE = "list_01.sps" /value.
WRITE OUTFILE = "list_01.sps" /lab.
WRITE OUTFILE = "list_01.sps" /list1.
WRITE OUTFILE = "list_01.sps" /list2.
WRITE OUTFILE = "list_01.sps" /space.
WRITE OUTFILE = "list_01.sps" /separate.
WRITE OUTFILE = "list_01.sps" /space.
If there is only one variable with same list (ex: V1), it works ok. However, if there is more than one variable having the same list, it reproduces the codes as much times as number of variables (Ex: V2, V4 and V5).
What I have (Ex: V2, V4 and V5), after running code above:
VARIABLE LABELS
V2 "Work"
.
VALUE LABELS
V2
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V4 "Married"
.
VALUE LABELS
V4
1 " Yes "
2 " No "
************************************************.
VARIABLE LABELS
V5 "Kids"
.
VALUE LABELS
V5
1 " Yes "
2 " No "
************************************************.
What I would like to have:
VARIABLE LABELS
V2 "Work"
V4 "Married"
V5 "Kids"
.
VALUE LABELS
V2 V4 V5
1 " Yes "
2 " No "
I think there are probably ways to automate the whole process better, including the use of your second data source. But for the scope of this question I will suggest a way to get what you asked for specifically.
The key is to build the command with special conditions for first and last lines:
string cmd1 cmd2 (a200).
sort cases by code.
match files /file=* /first=first /last=last /by code. /* marking first and last lines.
do if first.
compute cmd1="VARIABLE LABELS".
compute cmd2="VALUE LABELS".
end if.
if not first cmd1=concat(rtrim(cmd1), " /"). /* "/" only appears from the second varname.
compute cmd1=concat(rtrim(cmd1), " ", Var_label).
compute cmd2=concat(rtrim(cmd2), " ", Var).
do if last.
compute cmd1=concat(rtrim(cmd1), " .").
compute cmd2=concat(rtrim(cmd2), " ", ' 1 " Yes " 2 "No". ').
end if.
exe.
The commands are now ready, but we don't want to get them mixed up so we'll stack them one under the other, and only then write them out:
add files /file=* /rename cmd1=cmd /file=* /rename cmd2=cmd.
exe.
WRITE OUTFILE = "var definitions.sps" / cmd .
exe.
EDIT:
Note that the code above assumes you've already run a select cases if code = ... and that there is a single code in all the remaining lines.
Note also I added an exe. command at the end - without running that the new syntax will appear empty.

Doing BigDecimal accounting with Ruby. When can I convert to float?

I basically want to know the best to convert my BigDecimal numbers to readable format (float perhaps) for the purposes of displaying them to the client:
In order to figure out liabilities, I do contributions - distributions from each address.
For example, if a person contributes 2 units to an address and that same address distributes 1 unit back to the person, then there is still a 1 unit liability. that's what's going on below. These numbers are all units of cryptocurrency.
Here's another example:
Say address1 and address2 contribute 2 coins each to address3, and address3 distributes 1.0 coins to address1 and 0.5 coins to address2, then address3 has a 1.0 coin liability to address1 and a 1.5 coin liability to address2.
So the actual data using BigDecimal below:
contributions =
{"1444"=>#<BigDecimal:7f915c08f030,'0.502569E2',18(36)>,
"alice"=>#<BigDecimal:7f915c084018,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915c0a4430,'0.87161E1',18(36)>,
"address2"=>#<BigDecimal:7f915c0943f0,'0.84811E1',18(36)>,
"address3"=>#<BigDecimal:7f915c0a43e0,'0.385E0',9(18)>,
"address6"=>#<BigDecimal:7f915c09ebe8,'0.1E1',9(18)>,
"address7"=>#<BigDecimal:7f915c09eb98,'0.1E1',9(18)>,
"address8"=>#<BigDecimal:7f915c09d428,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915c09d3d8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915c0a7540,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915c0af8a8,'0.392E1',18(36)>,
"address12"=>#<BigDecimal:7f915c0a4980,'0.14E1',18(36)>,
"address13"=>#<BigDecimal:7f915c0af858,'0.2133333333 3333333333 33333334E1',36(54)>,
"address14"=>#<BigDecimal:7f915c0a54c0,'0.3533333333 3333333333 33333334E1',36(45)>,
"address15"=>#<BigDecimal:7f915c0a66e0,'0.1533333333 3333333333 33333334E1',36(36)>,
"sdfds"=>#<BigDecimal:7f915c0a6118,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915c0a6028,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915c0ae688,'0.3E0',9(18)>,
"address21"=>#<BigDecimal:7f915c0ae638,'0.3E0',9(18)>,
"address23"=>#<BigDecimal:7f915c0ae070,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915c0adf80,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915c0ad328,'0.1E0',9(18)>,
"add2"=>#<BigDecimal:7f915c0ad2d8,'0.1E0',9(18)>,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
and the distributions:
distributions =
{"1444"=>#<BigDecimal:7f915a9068f0,'0.502569E2',18(63)>,
"alice"=>#<BigDecimal:7f915a8f44e8,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915a906800,'0.87161E1',18(54)>,
"address2"=>#<BigDecimal:7f915a906710,'0.84811E1',18(54)>,
"address3"=>#<BigDecimal:7f915a906620,'0.385E0',9(36)>,
"address6"=>#<BigDecimal:7f915a8fdea8,'0.1E1',9(27)>,
"address7"=>#<BigDecimal:7f915a8fddb8,'0.1E1',9(27)>,
"address8"=>#<BigDecimal:7f915a8fd5e8,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915a8fd4f8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915a8fc9b8,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915a9071b0,'0.3920000000 0000003E1',27(45)>,
"address12"=>#<BigDecimal:7f915a907660,'0.1400000000 0000001E1',27(36)>,
"address13"=>#<BigDecimal:7f915a9070c0,'0.2133333333 3333337E1',27(45)>,
"address14"=>#<BigDecimal:7f915a906530,'0.3533333333 3333333333 33333334E1',36(54)>,
"address15"=>#<BigDecimal:7f915a8fc148,'0.1533333333 3333334E1',27(27)>,
"sdfds"=>#<BigDecimal:7f915a907f98,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915a907e08,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915a906ad0,'0.3000000000 0000003E0',18(27)>,
"address21"=>#<BigDecimal:7f915a9069e0,'0.3000000000 0000003E0',18(27)>,
"address23"=>#<BigDecimal:7f915a9063c8,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915a906238,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915a9060a8,'0.5E-1',9(27)>,
"add2"=>#<BigDecimal:7f915a905f18,'0.1E0',9(27)>}
Ideally, I want my liabilities to look like this:
{"add1"=>0.05,
"addx"=>0.1>,
"addy"=>0.1>}
But they look like this:
{"1444"=>0.0,
"alice"=>0.0,
"address1"=>0.0,
"address2"=>0.0,
"address3"=>0.0,
"address6"=>0.0,
"address7"=>0.0,
"address8"=>0.0,
"address9"=>0.0,
"address10"=>0.0,
"address11"=>-3.0e-16,
"address12"=>-1.0e-16,
"address13"=>-3.66666666666e-16,
"address14"=>0.0,
"address15"=>-6.6666666666e-17,
"sdfds"=>0.0,
"sf"=>0.0,
"address20"=>-3.0e-17,
"address21"=>-3.0e-17,
"address23"=>0.0,
"address22"=>0.0,
"add1"=>0.05,
"add2"=>0.0,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
I don't want to include -3.66666666666e-16 because that's essentially 0 and even Ruby accounts for it this way when you run -3.66666666666e-16 > 0... it returns false.
This is what I have... is the a better way? The code below is calculating the liabilities by subtracting con from dis and only selecting the liabilities that are greater than 0.0...that makes sense to me and it excludes 1-time grants of coins (there must be a matching contribution to be a liability). Then, I convert everything to floats so it's readable. Does this look right?
liab = #contributions.merge(#distributions) do |key, con, dis|
(con - dis)
end.select { |addr, amount| amount > 0.0 && #contributions.keys.include?(addr) }
liab.merge(liab) do |k, old, new|
k = new.to_f
end
I want the amount returned in float format, not the big decimal object. Is what I'm doing okay? Will I keep accuracy if I convert to float at the end?

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

U2 Universe Update Multi value field errror

I am using the Universe U2.net toolkit to update the record in universe database. We have so far no issue with update to non multi value field with the following code
Open_Again:
Try
db_connectionU2 = openConnU2()
db_connectionU2.Open()
Catch ex As Exception
GoTo Open_Again
End Try
Dim cmdWIP As New U2Command
'cmdWIP = New U2Command("DELETE FROM MPS", db_connectionU2)
cmdWIP = New U2Command("UPDATE POH SET EPOS=#FLAG where PONO='C11447'", db_connectionU2)
cmdWIP = New U2Command("UPDATE CURCVRD F8=#F8 where F0='51747*1'", db_connectionU2)
cmdWIP.Parameters.Add(New U2Parameter("#F8", U2Type.VarChar)).Value = "t"
cmdWIP.Connection = db_connectionU2
cmdWIP.ExecuteNonQuery()
cmdWIP.Dispose()
cmdWIP = Nothing
db_connectionU2.Close()
db_connectionU2.Dispose()
db_connectionU2 = Nothing
but it having the problem when we try to add in to multivalue field. It's return the error " Column being update from single to multi is illegal. Please see the red box for the message and the value we are writing in.
Please click below to see the screenshot
enter image description here
Thank you
You need to look at the DICT of that file and make sure your entries are marked and MultiValued and have an Multi-Value Association.
Here is an example from the HS.SALES demo account.
>LIST DICT CUSTOMER
DICT CUSTOMER 03:56:47pm 01 Dec 2016 Page 1
Type &
Field......... Field. Field........ Conversion.. Column......... Output Depth &
Name.......... Number Definition... Code........ Heading........ Format Assoc..
CUSTID D 0 P(0N) Customer ID 10R S
#ID D 0 CUSTOMER 10L S
SAL D 1 Salutation 5T S
FNAME D 2 First Name 12T S
LNAME D 3 Last Name 16T S
COMPANY D 4 Company Name 20T S
ADDR1 D 5 Address line 1 30T S
ADDR2 D 6 Address line 2 30T S
CITY D 7 City 12T S
STATE D 8 P(2A) State 2L S
MCU
ZIP D 9 P(5N) Zip 5L S
PHONE D 10 P("("3N")"3N Telephone 13R S
-4N)
PRODID D 11 P(1A4N) Product 5L M ORDER
S
SER_NUM D 12 P(6N) Serial# 6L M ORDER
S
Notice how PRODID has "M ORDERS" after is (the is drops to the next line thanks to the 80 char size of my terminal. This tells Universe that it is a multivalued field with an Association called ORDERS. This allows the SQL interpreter to know how to update things.
It gets a bit more complicated and I would recommend looking up HS.ADMIN and specifically HS.SCRIB for tips on formatting things for non-pick style consumption. Check the UVodbc guide for more info on that.

Match Multiple Patterns in a String and Return Matches as Hash

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

Resources