How do I use the balance classes option for autoML in the flow interface? - h2o

I'm trying to use autoML in the flow interface for a classification problem.
My response column is a enum data type with values of 1 and 0.
My data set is really imbalanced, around 0.5% of rows have a 1 response.
I want to try the balance classes option, but every time I try it, the program ends up throwing errors.
If I check the balance classes option, am I required to also input values in the class_sampling_factors input box? If so, what do I put in?
The documentation says:
"class_sampling_factors: (DRF, GBM, DL, Naive-Bayes, AutoML) Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. This option is only applicable for classification problems and when balance_classes is enabled."
But it seems like the function fails to run unless I put something in.
I've tried putting in 200.0, 1 and also 1.0,200.0 but neither seemed to work well.

You are not required to specify the "Class sampling factors" parameter when using "Balance classes".
I just verified on H2O 3.26.0.9 that you can successfully run AutoML with "Balance classes" checked and leaving the "Class sampling factors" blank by using the HIGGS dataset (10k subset). I also entered 1.0,0.5 for "Class sampling factors" and that worked as well. I don't see any bugs reported on older versions of H2O (not sure which version you are using), so maybe the error is caused by something else?
Here's the Flow output generated by both options:

Related

Partial KENDALL'S TAU-B (τb) correlation with bootstrapping

Is it possible to use partial KENDALL'S TAU-B (τb) correlation with bootstrapping?
I used SPSS syntax to run partial KENDALL'S TAU-B (τb) correlation. I followed steps of this website (https://toptipbio.com/spearman-partial-correlation-spss/) to run TAU-B without bootsrapping (by using 'TAUB'='CORR' instead of 'RHO'='CORR'). But I don't know how to have a bootsrapping result.
Because my sample size can be quite small (some groups can be as small as 8), I am not very confident to run KENDALL'S TAU-B without bootstrapping. (I am quite new to use statistics. Maybe I was wrong about bootstrapping. Please correct me if I am wrong.)
Thank you so much!!!
P.S., I tried to run bootstrapping before 'matrix out' command, and SPSS generated 2000 results (I set 2000 resampling for bootstrapping). And after I run 'matrix in' command, SPSS warned me something like that it cannot split the data as the 'matrix out' data set did. (It might be kind of confusing. The SPSS's warning was not written in English, and I am sorry that I couldn't understand it so to give a proper translation).

Cross Validation--Use testing set or validation set to predict?

I have a question about cross validation.
In Machine learning, we know there're training, validation, test set.
And test set is final run to see how the final model/classifier performed.
But in the process of cross validation:
we are splitting data into training set and testing set(most tutorial used this term), so I'm confused. Do we need to split the whole data into 3 parts: training, validation, test? Since in cross validation we just keep talking about relationship with 2 set: training and the other.
Could someone help clarify?
Thanks
Yep ,it's a little confusing as some material uses CV/test interchangeably and some material does not use ,but i'll try to make it easy to understand by giving the comprehension of why it's needed:
You need the train set to do exactly that, train, but then also you need a way to ensure that your algorithm isn't memorizing the train set(that it's not overfitting) and how well its doing, so that makes the need of the test set so you can give it data it has never seen and you can measure the performance.
But.... ML its all about experimentation, you will train, evaluate, tweak some knob(hyperparameters or architectures), train again, evaluate again over and over, and then you will select the best experiment results, you deploy your system and in production it gets data it's never seen and it doesn't perform that well ,what happened? You used your test data to fit parameters and make decisions , so you overfitted to this test data but you dont know how it does to data never seen.
Cross validation solves this, you have your train data to learn parameters, and test data to evaluate how it does on unseen data, but still need a way to experiment the best hyper parameters and architectures: you take a sample of your training data and call it cross validation set, and hide your test data , you will NEVER use it until the end.
Now use your train data to learn parameters, and experiment with hyper parameters and architectures, but you will evaluate each experiment on the cross validation data instead of test data(you can see it as using CV data as a way to learn the hyperparameters) , after you experimented a lot, and selected your best performing option(on CV), you now use your test data to evaluate how it performs on data it has never seen before deploying it to production.
This is generally an either-or choice. The process of cross-validation is, by design, another way to validate the model. You don't need a separate validation set -- the interactions of the various train-test partitions replace the need for a validation set.
Think about the name, cross-validation ... :-)

Global mapping of one subscript dimension to another database

I have a vendor defined database (about 140GB total) on Caché 2007. It uses the old style MUMPS programming environment and accesses globals directly in a hierarchical style. There is one global that accounts for about 75% of the total database size. The first subscript in this table is an artificial integer account number. The next 2-3 subscripts are constant subrecord identifiers that break up blocks of fields and denote repeating sub record kinds.
One of these repeating subrecords (record type 30) is for notes on an account. Because of the way the system is used, this dimension accounts for a very large portion of the global's total space; I'd estimate it to be at least 50%. Because of the way Caché stores data physically in the database, a scan of this global ends up loading all or most of these notes as a side effect even though they aren't relevant to most operations. It has the effect of greatly increasing the cost of IO operations on the global, especially when you only want one tiny detail from a bunch of accounts.
Example subscript references for this global:
^ACCT(3461,10,1)="SOME^DATA"
^ACCT(3461,10,2)="MORE^DATA"
...
^ACCT(3461,30,1)="NOTE1 blah blah"
^ACCT(3461,30,2)="NOTE2 blah blah"
...
^ACCT(3461,30,100)="NOTE100 blah blah"
I can't change the design of the database. It's controlled by an outside vendor and there is a large amount of MUMPS style hardcoded references in the database. I'm thinking that a big reason that batch operations are so slow on the system are due to the high cost of these mostly irrelevant notes coming along for the IO ride whenever account data is accessed. Scanning this whole global (i.e. when there is no useful application maintained index) takes at least 8 hours.
One thought I had is to shift the note data from being stored along side other details in the global to a separate database file by using the global mapping facility described in the Guide to Using Caché Globals and Guide to System Administration. If I could map all the subscript 30s to a separate database file in the same Caché database, most data operations (the ones that don't even care about notes) wouldn't be bringing those in to memory along with the details they do care about.
In the global structure guide (1st link), this looks plausible as they show a particular 2nd subscript mapping separately than the 1st subscript. What they don't show in any of the examples is what the syntax is to make that happen. In the "Add a new global mapping" screen in the Caché Management Portal, I should be able to do something like
Global name: ACCT
Subscripts to be mapped: (BEGIN:END)(30)
But whatever variations I try in the syntax, I always get ERROR #657: Invalid subscript in reference 1 subscript #1.
StackExchange note: This question would possibly be better suited to dba.stackexchange.com but there are apparently zero Intersystems questions there and I don't think it would get any attention.
Unfortunately, while it's possible to map 2nd level subscripts of a particular node, it's not possible to map 2nd level subscripts of all nodes.
There is an experienced Performance team on WRC, did you try to contact them?

Practices for allowing systems to accommodate human error?

Systems have to sometimes accommodate the possibility of real world bad data. Consider that some data originates with paper forms. And forms inherently have a limited means of validating data.
Example 1: On one form users are expected to enter an integer distance (in miles) into a blank. We capture the information as written as a string since we don't always end up getting integer values.
Example 2: On another form we capture a code. That code should map to one of the codes in our system. However, sometimes the code written on the form is incorrect. We capture the code and allow it to exist with an invalid value until some future time of resolution. That is, we temporarily allow bad data since it's important to record the record even if some of it is invalid.
I'm interested in learning more about how systems accommodate bad data, that is, human error. Databases are supposed to be bastions of data integrity, but the real world is messy and people make mistakes. Systems must allow us to reflect those mistakes.
What are some ways systems you've developed accommodate human error? What practices have you used? What lessons have you learned?
Any further reading on the topic? (I had trouble Googling it.)
I agree with you, whatever we do there's no guarantee that we can get rid of bad or incorrect data. Especially, but not only, if it comes to user input. In my experience the same problems exist in complex integration projects, in which you have to integrate and merge (often inconsistent) data retrieved from different systems.
A good strategy is to decouple the input from the operational system itself. First, place user (or external system) provided data in a separate datastore (e.g. different schema). In a second step load this data into your operational datastore, but only if it confirms to strict rules (e.g. use address verification software to verify a given address). This Extract, Transform, Load (ETL) approach is fairly common in Data Warehousing (DWH) solutions, but can be applied programmatically in transactional systems as well (in my experience).
The above approach often leads to asynchronous processes in which the input is subitted first and (maybe) at a later time the external entity (user or system) retrives feedback whether its data was correct or not.
EDIT: For further readings I recommend to have a look at DWH concepts. Alhtough, you may not want to build such a thing, you could partially apply those concepts:
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Data_warehouse
http://en.wikipedia.org/wiki/Data_cleansing
A government department I worked in does a lot of surveys, most of which are (were) still paper based.
All the results were OCR'd into the system.
As part of the OCR process a digital scan of the forms is kept.
Data is then validated, data that is undecipherable or which fails validation is flagged.
When a human operator reviews the digital data they can modify the data if they are confident that they can correctly interpret what the code could not; they (here's the cool bit) can also bring up the scan of the paper based original, and use that to determine what the user was trying to say.
On a different thread; at some point you want to validate the data coming in against any expected data ranges that you want it to conform to; buy rejecting it at the point of entry you give the user a chance to correct it - the trade off is that every time you reject it you increase the chance of them abandoning the whole process.
At some point in your system you need to specify the rules which will be used for validation. At the end of the day a system is only going to be as smart as those rules. You can develop these yourself into the code (probably the business logic) or you might use a 3rd party component.
having flexible control over the validation is pretty important as they are likely to change overtime.
To be honest with you, one point of migrating from paper-based systems to IT is to remove these errors and make sure all data is always correct. I doubt any correctly planned and developed IT system (especially business financial systems) would allow such errors. Not in the company I am working for anyway...
There are lots of software tools that address the kinds of problems you mention. There are platforms and tools that let you define rules for scrubbing and transforming data and handling validation errors. Those techniques are widely used for Data Integration and Business Intelligence applications. Google for "Data Quality" or "Data Integration".
The easiest thing to do is to (this is not always possible) design the interface where users enter the data to limit as much as possible the amount of text that they need to enter. In my experience this seems to be where a lot of problems come from. One simple example of this is to provide a select, or auto-complete select field
One thing that you could do is do everything possible to determine if the data is correct before going into the db. I try to give the user entering the data as much feedback as possible so they can (ideally) fix some of the issues before the data gets persisted. For example, it is a very quick check to determine if the data being entered is of the correct type.
I got started in legal systems before the PC era. Litigation support databases routinely have to accommodate factually incorrect, incomplete, and contradictory information. It takes a different way of thinking.
The short version . . .
Instead of recording a single fact, you record multiple assertions about a fact. It boils down to designing a database to store data from assertions like these.
In an interview at 2011-01-03 08:13, Neil Rimes told Officer Cane
that he was at home from 2011-01-02 20:00 until 2011-01-03 08:13.
In an interview at 2011-01-03 08:25, Liza Nevers told Officer Cane
that Neil Rimes came home at 2011-01-02 23:45.
In a deposition at 2011-05-13 10:22, Cody Maxon told attorney Kurt
Schlagel that he saw Neil Rimes at Kroger at 2011-01-03 03:00

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

Resources