ARFF parsing in ELKI 0.6.5 or 0.6.0 - arff

I would like to use the newest version of ELKI, but I get errors leading to nullpointerexeptions and that task fails. When using 0.6.0 it works fine.
Here is some toy arff-data:
#ATTRIBUTE 'var_0032' real
#ATTRIBUTE 'id' real
#ATTRIBUTE 'outlier' {'no','yes'}
#DATA
0.185185185185,1.0,'no'
0.0740740740741,2.0,'no'
But I get the failure in 0.6.5:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'}
Task failed
java.lang.NullPointerException
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.processNewResult(VisualizerContext.java:300)
at de.lmu.ifi.dbs.elki.visualization.VisualizerContext.<init>(VisualizerContext.java:141)
at de.lmu.ifi.dbs.elki.visualization.VisualizerParameterizer.newContext(VisualizerParameterizer.java:193)
at de.lmu.ifi.dbs.elki.visualization.gui.ResultVisualizer.processNewResult(ResultVisualizer.java:116)
at de.lmu.ifi.dbs.elki.workflow.OutputStep.runResultHandlers(OutputStep.java:70)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:120)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:60)
at [...]
In the 0.6.0 this just seems to be a warning:
Invalid quoted line in input: no closing quote found in: #ATTRIBUTE 'outlier' {'no','yes'} it still produces the ROCCURVE.
Should I be worried?
Should I change my arff file, and how?

The ARFF file format (https://weka.wikispaces.com/ARFF+%28developer+version%29) doesn't use quotes there.
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id NUMERIC
#ATTRIBUTE outlier {no,yes}
#DATA
0.185185185185,1.0,no
0.0740740740741,2.0,no
also, if your id column is really an id, don't give it real (which is only an alias for numeric) datatype. It's not a numerical column, and if you aren't careful it may be misused in the analysis.
So maybe better use something like this:
#RELATION example
#ATTRIBUTE var_0032 NUMERIC
#ATTRIBUTE id STRING
#ATTRIBUTE class {no,yes}
#DATA
0.185185185185,'1',no
0.0740740740741,'2',no
to get a proper ARFF file. I havn't tested, does this work better?

First of all, definitely use 0.6.5. ELKI is not at a 1.0 release yet, there are bugs. They will not be fixed in old versions, only in the new version, because we still need to be able to do larger API changes. Essentially, there should be no reason to use anything but the latest version. ELKI 0.7 will appear end of August at VLDB 2015.
ARFF is not used a lot. There may be errors in the parser, and ARFF support for categorial data is very very limited right now. The strengths of the ARFF format are when you have lots of categorial attributes, but that is mostly used in classification - and ELKI doesn't include much classification algorithms yet (since Weka is a strong tool for that already, we focus on algorithms that are not available/good in Weka).
Batik errors like this are usually due to NaN or infinite values. There are still some errors in the visualization code because SVG doesn't give good type safety, unfortunately. You can easily build SVG documents that are invalid, or that contain invalid characters such as ∞ in some coordinate, and then the Batik renderer will fail with such an error message.
What are you exactly trying to do? It looks a bit as if you are trying to compute the ROC curve for the existing output of an algorithm? I don't think there is an easy way to read an ARFF file containing (score, id, label) rows and compute a ROC curve using the MiniGUI. It's not hard to do in Java code, but it's not a use case of the KDD process workflow of the UI.

Related

Freemarker: How to write a BigDecimal's value that can be used in a BigDecimal constructor

I would like to use freemarker to generate java code that instantiates a BigDecimal.
The value of the BigDecimal is present at generation time.
the BigDecimal API would work like this:
BigDecimal copy = new BigDecimal(original.toString());
Alas, the freemarker magic uses numeric conversion on my value of original, so this does not work (in a freemarkter template):
BigDecimal copy = new BigDecimal("${original?c}");
None of the numeric conversions (percent, number, computer, ...) works - c/computer most interestingly, because it outputs 0 if the value becomes too big.
With considerable pain, I might be able to wrap the BigDecimal in another Object that just gives me a toString and is not a number, so freemarker might leave its value untouched and my generated BigDecimal is correct, but that's only a last resort.
Maybe there is a way to invoke the toString() method and print the result into the target document?
ERRATUM:
because it outputs 0 if the value becomes too big should read because it outputs 0 if the value becomes too small (see #ddkany's answer)
Update: FreeMarker 2.3.32 now supports lossless formatting with ?c, which is not based on DecimalFormat, but mostly on toString. To use that, you have to set the c_format configuration setting to any other value than its backward compatible default, "legacy". Like setting it to "JavaScript or JSON" is fine for most projects (that's also the default if you just set the incompatible_improvements configuration setting to 2.3.32). See the fine details of how numers are formatted in the documentation of the c built-in: https://freemarker.apache.org/docs/ref_builtins_number.html#ref_builtin_c
Old answer, for 2.3.31 and before:
What FreeMarke does by default is whatever Java's DecimalFormat does (for the localized medium format by default, again defined by the Java platform).
?c uses new DecimalFormat("0.################") with fixed US locale (and some more symbol adjustments for INF and NaN). So I don't know how that gives 0 for a huge BigDecimal number. Are you sure about that? I guess it was actually a very small number, so it was rounded down. Well, switching to "scientific" format would make more sense then, though.
To have whatever formatting logic you need, you can register your own number formatter like, in this case, configuration.setCustomNumberFormats(Map.of("toString", new MyToStringTemplateNumberFormatFactory()), and then you can use ${original?string.#toString}. Or rather, you can set the number_format to #toString, and then just use ${original}.

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Primitive type as data structure for API Blueprint

I want to use primitive type for describe data structure. Like so:
# Data Structures
## Video Delete (enum[number])
+ `0` - Successful deletion.
+ `1` - Error occured.
And the output is.
{
"enum": [
1,
0
],
"$schema": "http://json-schema.org/draft-04/schema#"
}
So description is missing. I've tried to put description in different places. I did a lot of things (do not wanna talk about them). Also I've tried to add info to enum values like so:
+ `0` (number) - Successful deletion.
I do not know whether this problem deals with MSON syntax or Aglio generator.
The syntax above is supported by MSON as far as I can tell. The problem is that Aglio doesn't do anything with the description, and when I went to look into adding it I realized that it isn't really supported in JSON Schema. There seem to be two methods people use to get around that fact:
Add the enumerated value descriptions to the main description, the Olio theme 1.6.2 has support for this but the C++ parser seems to still have some bugs around this feature:
## Video Delete (enum[number]) - 0 for success, 1 for error
Use a weird oneOf syntax where you create sets of single enums with a description. I don't recommend this.
Unfortunately the first option requires work on your part and can't easily be done in Aglio. Does anyone else have a better description and some samples of MSON input -> JSON Schema output?

Specifying the form of names for categories generated by patsy/statsmodels 'C'

By default, Patsy's C seems to generate categories with names of the form
C(color, Treatment('White'))[T.Green]
at least when used in a formula provided to statsmodels old. Is there a way to specify that C generate less verbose category names, e.g., of the form
colorGreen
or even simply
Green
There's an issue for this open. Please discuss alternatives there.
https://github.com/pydata/patsy/issues/19
Bit late to the party but for those searching this in 2021.
If you're prepared todo a bit of wrangling, you can take apart the statsmodel Summary object (returned when calling summary() on a fitted model), convert it to a DataFrame, and format it from there.
The Summary object has a tables attribute. The first is the result of the fit, the second is the coefficents table. The tables have an as_html() method that you can pass to the pandas read_html() method.
df = pd.read_html(your_fitted_model.summary().tables[1].as_html(), header=0)[0]
From there you can strip out the patsy formatting via regular string and dataframe methods.

Convert to E164 only if possible?

Can I determine if the user entered a phone number that can be safely formatted into E164?
For Germany, this requires that the user started his entry with a local area code. For example, 123456 may be a subscriber number in his city, but it cannot be formatted into E164, because we don't know his local area code. Then I would like to keep the entry as it is. In contrast, the input 089123456 is independent of the area code and could be formatted into E164, because we know he's from Germany and we could convert this into +4989123456.
You can simply convert your number into E164 using libphonenumber
and after conversion checks if both the strings are same or not. If they're same means a number can not be formatted, otherwise the number you'll get from library will be formatted in E164.
Here's how you can convert
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
String formattedNumber = phoneUtil.format(inputNumber, PhoneNumberFormat.E164);
Finally compare formattedNumber with inputNumber
It looks as though you'll need to play with isValidNumber and isPossibleNumber for your case. format is certainly not guaranteed to give you something actually dialable, see the javadocs. This is suggested by the demo as well, where formatting is not displayed when isValidNumber is false.
I also am dealing with this FWIW. In the context of US numbers: The issue is I'd like to parse using isPossibleNumber in order to be as lenient as possible, and store the number in E164. However then we accept, e.g. +15551212. This string itself even passes isPossibleNumber despite clearly (I think) not being dialable anywhere.

Resources