reducing CRFClassifier model file size - stanford-nlp

I am training a chunker using CoreNLP's CRFClassifier and I would like to reduce the size of the generated model file. I thought that I could use the featureCountThreshold property to threshold uncommon features and in this way reduce the file size, but I have tried several thresholds and the file size is always the same, so either I am doing something wrong or I misunderstood the featureCountThreshold property.
This is how I instantiate the CRFClassifier:
val props = new Properties()
props.setProperty("macro", "true")
props.setProperty("featureFactory", "edu.arizona.sista.chunker.ChunkingFeatureFactory")
props.setProperty("featureCountThreshold", "10")
new CRFClassifier[CoreLabel](props)
The code is in scala, but it should be straightforward.
Is this the right way to reduce the file size? And if not, is there a way to accomplish this?

For the next person trying to do this:
There are two properties with similar names in CoreNLP: featureCountThreshold and featureCountThresh. featureCountThresh is the correct one for this task.
We were able to reduce a model from 321M to 54M using a featureCountThresh of 10 and still retain almost the same performance.

Related

What is the default target metric that H2O models use for their predict() method? Can change?

I am using a H2ORandomForestEsimator. What is the default target metric that H2O models use for their predict() method?
https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2o.automl.H2OAutoML.predict
Is there a way to set this? (Eg. to use one of the other metric maximizing thresholds that can be seen when looking at the results of get_params() method)
Currently am doing something like...
df_preds = mymodel.predict(df)
activation_threshold = mymodel.find_threshold_by_max_metric('f1', valid=True)
# adjust the predicted label for the desired metric's maximizing threshold
df_preds['predict'] = df_preds['my_positive_class'].apply(lambda probability: 'my_positive_class' if probability >= activation_threshold else 'my_negative_class')
see
https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?highlight=find_threshold#h2o.model.binomial.H2OBinomialModel.find_threshold_by_max_metric
https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html?highlight=apply#h2o.H2OFrame.apply
There's no concept of a "target metric" when generating predictions, since you're just predicting the response for a row of data (there's no scoring here).
Edit: Thanks for clarifying your question. If you want to change how the threshold is generated, then what you're doing above is a good solution. If you have a suggestion for a utility function that would make this more straight-forward, please file a JIRA with your idea (it could definitely be improved).

Cubism with genomic data (or non-timeseries data)

I'd like to hear your thoughts on what would it take to make cubism work with non timeseries data, concretely, genomic data.
These type of data has a locus (a chromosome and coordinates within that chromosome) instead of a timestamp:
chrm1 145678123 value
chrm12 45345 value
chrmX 4535 value
....
What option do you think is best, hacking cubism's core to allow for these type of data (or any type of data for that matter) or spawning a new project all together?
UPDATE: I decided to implement a modified version of cubism for DNA. I call it DNAism and you can find it here. Take a look and let me know what you think.
-drd
Cubism is probably not the right kind of library for this task. You're going to have to modify the library in a pretty significant way. Instead of doing that I'd recommend you use the d3.horizon plugin so that you can gain a lot more control by creating custom scales.
Hope this answers your question.

Can Groups be used to emulate the "class" or "struct" data structures from other languages

Is there a data structure within LiveCode that can be used as a "holder" for associated data, letting me handle it collectively? I come from a Java / Javascript / C background so I am looking for a Class or Struct sort of data structure.
I've found examples of Groups, which seem to have some of this functionality, but it feels a bit like I'm bending the language to meet my needs.
As a specific example, suppose I had an image field on my screen that would randomly display an image and, when pressed, play an associated sound clip. I'd expect to create a list of "structures" that contained the path to the image and the path to the associated sound clip, and use that data to populate the image field and to decide what sound clip to play.
Would a Group be the correct structure to use in this case? Or am I approaching this in a way that isn't really fitting with the way LiveCode works?
It takes a little getting used to, but the xTalk world is much simpler and more open than any ordinary procedural language. So much of what you once had to manage is no longer required.
So when splash21 said that you could store all your image and sound references in a custom property, he was really saying that the LiveCode environment contains intrinsic, high level functionality that makes these sorts of things instantly accessible, and the only thing required of you is to call for them, and they simply work.
The only way to appreciate this is to make a few simple programs, to really see what is possible. Make your application. Everything you mentioned can be accomplished with perhaps a dozen lines of code in a single handler. I recommend that you join the LiveCode use list and forums. The community is vibrant and eager to help, frequently with full blown solutions to specific problems, but more importantly, as guides and mentors to new users
Craig Newman
Arrays in LiveCode are actually associative arrays (like hash maps). A key is associated with a value. The value might be as well an array.
Chapter 5.5.7 of the User's Guide says
Array elements may contain nested or sub-elements, making them multi-dimensional.
This type of array is ideal for processing hierarchical data structures such as trees or
XML. To access a sub-element, simply declare it using an additional set of square
brackets.
put "ABC" into myVariable["myKeyName"][“aChildElement”]
see also
How to store pictures in a stack?
Dave- I'm hoping to get a struct-like container implemented in the near future. Meanwhile you can, as splash21 mentioned, use custom properties (or better yet, custom property sets) to do what you want. This will give you a pseudo-struct for each object and you can implement the file and sound specifications into the properties. And if you use that in conjunction with a behavior object you'll end up very close to a real inheritable class formation.

How to create my own RecommenderJob?

I have found several tutorials on how to create my own non-distributed recommender but none how to create my own distributed recommender job (any link is welcome if you know one).
In the book “Mahout in Action” there are some examples of how to write Mappers/Reducers using Mahout’s objects, but it does not seem to show how to put these jobs together?
However there is item/RecommenderJob in mahout-core which gives an idea of how this can be done. My actual intent is to replace the first mapper so that I don't have to prepare my data outside of mahout (lines look like "userid,itemid1,itemid2,itemid3..." and using item.RecommenderJob I obviously need lines like "itemid1,itemid2", "itemid1,itemid3", ...).
Now would it be a good idea to just copy over the RecommenderJob class and change what I need?
I have tried it, but since this class uses variables that are in package scope (e. g. UserVectorSplitterMapper.USERS_FILE) I have to replace these – which does not feel good.
Should I rather create a new class extending AbstractJob and pick out the things I need from RecommenderJob? Then what are the elements in RecommenderJob that I really need?
Your alternatives are to precede the job with your own job that translates your input into a form the job wants, or, indeed to just modify the job. I don't think it's a big deal to copy the job and modify and customize it if you need non-trivial changes that aren't (and wouldn't make sense to be) supported as some kind of config parameter.

Finding Duplicate or Similar Images on a specific directory on a database

I am new on this, and my objection is to build some web application that implement the user to store an image on a database as a storage, and all I want is to reduce if there is a couple or some image that stored twice or more.
So, all I need is how to find duplicate or similar images that already stored on a database, or even better when the user try to import it on the first step, and if their image are similar with an images that already been stored on a database, the system can gave a warn not to store that image.
I just want to develop how to find some similar or duplicate image on a specific directory on a database. Can you give me some explanation from the first about how to build it, and what should I learn to accomplished this from the basic step, like a tutorial or something. I'd like to learn a lot, if it's possible.
Thanks in advance, I really need this help, thanks.
The solution for finding similar images is much more complex so I will stick to the finding duplicate images first. The easiest thing to do is to take a SHA1 hash of image bits. Here is some code in C# to accomplish this (see below). As for storing the hash in a database, I would recommend that you use a binary(20) datatype to store the results of the hash. This allows your SQL server to index and query much faster than storing this hash as a string or some other format.
private static byte[] GetHashCodeForFile(string file)
{
int maxNumberOfBytesToUse = 3840000;
using (Stream sr = File.OpenRead(file))
{
byte[] buffer = (sr.Length > maxNumberOfBytesToUse) ? new byte[maxNumberOfBytesToUse]: new byte[sr.Length];
int bytesToReadIn = (sr.Length < maxNumberOfBytesToUse) ? (int)sr.Length : maxNumberOfBytesToUse;
sr.Read(buffer, 0, bytesToReadIn);
System.Security.Cryptography.HashAlgorithm hasher = System.Security.Cryptography.SHA1.Create();
byte[] hashCode = hasher.ComputeHash(buffer);
return hashCode;
}
}
Searching for similar images is a difficult problem currently undergoing much research. And it kind of depends on how you define similar. Some prominent methods for finding similar images are:
Check the metadata (EXIF or similar) tags in the image file for creation date, similar images can be taken at times that are similar to each other. This may not be the best thing for what you want.
Calculate the relative historgram of both images and compare them for deltas in each color channel. This has the benefit of allowing an SQL query to be written and is invariant to image size. An image that has been converted to a thumbnail will be found with this method.
Performing an image subtraction between two images and seeing how close the image gets to pure black (all zeros). I don't know of a method to do this with a TSQL query and this code can get tricky with images that need to be resized.
Calculating the contours of the image (through Sobel, canny or other edge detectors) then subtract the two images to see how many of their contours overlap. Again I don't think this can be handled in SQL.

Resources