CDF to normalize data - methods

I really need your help. I want to scale my data between 0 and 1 to cluster it afterwards. Does it make sense to use the cummalitive distribution function (CDF) to normalize the data in advance? (My features have different value ranges.) Please with reason. Is there anything comparable in the literature? I didn't really find anything. I am very grateful for help!

Related

Statistics to validate model with independent data set

I am working on modeling the understory forest using the RandomForest classifier. The results are the probability values of understory tree occurrence. And I have an independent dataset, which was not utilized in model building. I want to test how reliable the prediction model is against the field data.
I would like to know what statistics should I use to do it? I was thinking to use a t-test but I doubt it is good statistics. I wonder if I can use kappa statistics or agreement statistics but I am not so sure about it. I hope someone can help me with this. Thank you.

How do I classify when I have only positive use case of data in machine learning?

My question is Can we classify something when we have a set of data which is related to only positive use case of a data?
I know this is confusing, let me put it this way, say..
In a classifier, we have to train it with A and B where A is the positive data set and B is the negative data set. But here I have only positive data set of my action/use case and have no way to retrieve negative data set.
Can I use machine learning to classify whether the coming data is positive or not?
If yes, which classifier can be used to get my job done?
A ruby based solution is preferred.
Thanks.
May be One Class Learning using Support Vector Machines will help you. But these type of ML algorithms has a lot of limitation when you apply.

Identify changes in the slope using machine learning

I want to get my hands dirty with some machine learning, and I finally have a problem which seems like a good beginner project. However, despite reading a lot about the subject I am unsure how to get started, and what my basic approach should be.
I have a dataset which should look like this.
a real dataset looks more like this:
I want to identify the points in the red circles (on the first image), and be robust against occasional artifacts like the one in the blue circle.
I sounds like a really easy task. However, the is quite a lot of noise in the raw data. My current implementation is pretty traditional. It blurs the data and compares the first and second derivative to some estimated threshold values. This approach works, but can "only" identify the points with ~99.7% accuracy, but since I do around 100.000 measurements a day I would love to increase this number.
So, this is what I have:
All the datasets I want/need
A pretty good model of how the data should look.
A pretty good training set, using my existing algorithm (the outlines can be fixed manually)
However, I do not have a basic idea how what approach I should use. I feels like none of the material I've read on machine learning fit's this problem.
Can someone help me with the super high level approach to solve this problem?

Point Intersection With Polygon in Ruby

How can I quickly find which of a set of polygons contain a given point?
I have a collection of polygons in a POSTGis database. I'm using RGeo on the ruby side to manipulate, save, and pull information from/to the database.
I receive a point (x and y coordinates) from an external machine and need to know which of my polygons this point lies within. I can't use the database because I need this to be done in memory for performance reasons.
I believe I might need an r-tree, but I don't exactly want to write one.
RGeo provides a contains? method that I can use to ensure a point is within a polygon of interest, but I need to know which polygon to check. I have on the order of 1,000 polygons and doing a linear search is not time efficient enough for my needs.
can this help? otherwise, there is this.
It seems that neartree is a better thing to search for w.r.t. ruby.
Hope this helps!
EDIT: if you need a general purpose implementation of an rtree, maybe the boost (c++) library can help there are bindings for it here.
that has bindings for methods which should help your use-case:
intersects?
intersects_each?
intersects_rect?
intersects_rect_each?

Converting spectral data for given Observer/Illuminant to another Observer/Illuminant

I'm working on a simple Measuring Software for HunterLab (Color) instruments (EZ line) (screenshot here) and I hope someone can help out here.
They deliver spectral data from 400nm...700nm by 10nm using a D65 light source and 10° Observer.
I have the observer functions for ASTM D65 which work great and I can reproduce any value from the instrument 1:1, as long as i measure in D65, 10° (converting to XYZ and then CIELab using tristimulus references for perfect reflecting diffuser).
That was done mostly using algorithms from brucelindbloom.com and easyrgb.com, both have some great information!
Now I want to add the ability to convert the spectral data to another observer or another illuminant (or both). But I cant wrap my head around how to do that.
I guess some directions would be enough but I dont know if I would need even more references for that (references for illuminants by wavelength?) or if its done by some other means.
OK, here is the answer :)
Spectral data from most spectrophotometers is already corrected in so far that the hardware illuminant and angle dont matter.
What you do is just use the observer functions for every single angle/illuminant, as written in ASTM E308, to convert the spectral data to XYZ instead of only using the table which corresponds to the hardware illuminant/angle.
Thats a lot of reference values but it works perfect.

Resources