How to detect which data are affecting the result of a feature with machine learning? - algorithm

Firstly, I will illustrate the scenario that I have a dataset like;
ProductID, ProductType, MachineID, MachineModel, MachineSpeed, RejectDate, RejectVolume etc.
And I want to find which field(s) is the reason for the increase in my RejectVolume? Also, in the scenario, all products have a RejectVolume. I mean RejectVolume is nonzero and there are continuous but different values. Thanks to this, I can recognize the reason(s) and find the solution for reducing the value of RejectVolume.
Can you give me any ideas for creating the model?
Thank you.

You want to look at Feature Selection methods.
In this scenario you could start with Linear Regression using Lasso for feature selection. This is done by successively increasing the lasso regularization term, which will decrease the weight of unimportant features, leaving you with the features with the most impact.

Related

How to deal with unfeasible individuals in Genetic Algorithm?

I'm trying to optimize a thermal power plant in a thermoeconomic way using Genetic Algorithms. Creating population gets me with a lot of unfeasible Individuals (e.g: ValueErros, TypeError etc.). I tried to use Penalty Functions, but the GA get stucked in first populations with a feasible Individual fitness and it doesn't evolve. There's any other way to deal with it?
I will be grateful if anyone can help me
Thank in advance
Do not allow such individuals to get part of the population. It will slow down your convergence but you will garantee that solutions found are fine.
You may want to look into Diversity Control.
In theory, invalid individuals may contain advantageous/valid pieces of code, and discarding them just because they have a bug is wasteful. In diversity control, your population is grouped into different species based on similarity metric (for tree structures it's usually edit distance), then the fitness of each individual is "shared" with other members of the group. In such a case fitness = performance/group_size. This is usually done to prevent premature convergence and to widen the exploration.
By combining your penalty function with diversity control, if the group of valid individuals becomes too numerous, fitness within that group will go down, and the groups that throw errors yet are less numerous will become more competitive, carrying the potentially valuable material forward.
Finally something like the rank-based selection should make the search insensitive to outliers, so when your top dog is 200% better than the other ones, it won't be selected all the time.

How to validate Cluster(anylsis) for high dimensional Data (gene expression)

Hello I'm new here I hope I have entered everything correctly and this question is in the right forum. Also, I have checked before and no previous question seems to be comparable with this one.
To my question:
I am currently working on the validation of cluster methods using the package clValid. Now my dataset with which I work is very large (1,000 to 25,000) it is gene expressions. Now the question is which methods for the validation of high dimensional data sets come into question at all. Maybe there is another package for validating clustering in high-dimesnion space. Do I have to do a PCA before? how big can my dataset be so that I can use clValdi on it (I don't want to let my computer run for hours or should I just let it run and wait for a result with a small dataset 100x500) I am grateful for every suggestion maybe there are solutions I haven't thought about yet.
clValid
I would rather not rely on any of these indexes.
These measures usually require clusters to be complete and disjoint, and that does not hold for typical Gene biclusters. There are Genes not involved in any of the effects observed in the experiment
The measures we're usually designed with low-dimensional Gaussian data in mind, and once you have such high dimensional data where all distances are large, they measure that there is no contrast between clusters (because their measure does not see contrast between any two data points
I fear that you may need to evaluate by complex, domain-specitic analysis.

How to judge performance of algorithms for Text Clustering?

I am using K-Means algorithm for Text Clustering with initial seeding with K-Means++.
I try to make the algorithm more efficient with some changes like changing the stop-word dictionary and increasing the max_no_of_random_iterations.
I get different results. How do i compare them ? I could not apply the idea of confusion matrix here. Output is not in the form of some document getting some value or tag. A document goes to a set. It is just relative "good clustering" or the set that matters.
So Is there some standard way for marking the performance for this output set ?
If confusion matrix is the answer, please explain how to do it ?
Thanks.
You could decide in advance how to measure the quality of the clusters, for example count how many empty ones or some stats like Within Sum of Squares
This paper says
"... three distinctive approaches to cluster validity are possible.
The first approach relies on external criteria that investigate the
existence of some predefined structure in clustered data set. The
second approach makes use of internal criteria and the clustering
results are evaluated by quantities describing the data set such as
proximity matrix etc. Approaches based on internal and external
criteria make use of statistical tests and their disadvantage is
high computational cost. The third approach makes use of relative
criteria and relies on finding the best clustering scheme that meets
certain assumptions and requires predefined input parameters values"
Since clustering is unsupervised, you are asking for something difficult. I suggest researching how people cluster using genetic algorithms and see what fitness criteria they use.

Scalable real time item based mahout recommender with precomputed item similarities using item similarity hadoop job?

I have the following setup:
boolean data: (userid, itemid)
hadoop based mahout itemSimilarityJob with following arguements:
--similarityClassname Similarity_Loglikelihood
--maxSimilaritiesPerItem 50 & others (input,output..)
item based boolean recommender:
-model MySqlBooleanPrefJDBCDataModel
-similarity MySQLJDBCInMemoryItemSimilarity
-candidatestrategy AllSimilarItemsCandidateItemsStrategy
-mostSimilarItemsCandidateStrategy AllSimilarItemsCandidateItemsStrategy
Is there a way to use similarity cooccurence in my setup to get final recommendations? If I plug SIMILARITY_COOCCURENCE in the job, the MySqlJDBCInMemorySimilarity precondition checks fail since the counts become greater than 1. I know I can get final recommendations by running the recommender job on the precomputed similarities. Is there way to do this real time using the api like in the case of similarity loglikelihood (and other similarity metrics with similarity values between -1 & 1) using MysqlInMemorySimilarity?
How can we cap the max no. of similar items per item in the item similarity job. What I mean here is that the allsimilaritemscandidatestrategy calls .allsimilaritems(item) to get all possible candidates. Is there a way I can get say top 10/20/50 similar items using the API. I know we can pass a --maxSimilaritiesPerItem to the item similarity job but i am not completely sure as to what is stands for and how it works. If I set this to 10/20/50, will I be able to achieve what stated above. Also is there way to accomplish this via the api?
I am using a rescorer for filtering out and rescoring final recommendations. With rescorer, the calls to /recommend/userid?howMany=10&rescore={..} & to /similar/itemid?howMany=10&rescore{..} are taking way to longer (300ms-400ms) compared to (30-70ms) without the rescorer. I m using redis as an in memory store to fetch rescore data. The rescorer also receives some run-time data as shown above. There are only a few checks that happen in rescorer. The problem is that as the no. of item preferences for a particular user increase (> 100), the no. of calls to isFiltered() & rescore() increase massively. This is mainly due to the fact that for every user preference, the call to candidateStrategy.getCandidatItems(item) returns around (100+) similar items for each and the rescorer is called for each of these items. Hence the need to cap the max number of similar items per item in the job. Is this correct or am I missing something here? Whats the best way to optimise the rescorer in this case?
The MysqlJdbcInMemorySimilarity uses GenericItemSimilarity to load item similarities in memeory and its .allsimilaritems(item) returns all possible similar items for a given item from the precomputed item similarities in mysql. Do i need to implement my own item similarity class to return top 10/20/50 similar items. What about the if user's no. of preferences continue to grow?
It would be really great if anyone can tell me how to achieve the above? Thanks heaps !
What Preconditions check are you referring to? I don't see them; I'm not sure if similarity is actually prohibited from being > 1. But you seem to be asking whether you can make a similarity function that just returns co-occurrence, as an ItemSimilarity that is not used with Hadoop. Yes you can; it does not exist in the project. I would not advise this; LogLikelihoodSimilarity is going to be much smarter.
You need a different CandidateItemStrategy, particularly, look at SamplingCandidateItemsStrategy and its javadoc. But this is not related to Hadoop, rather than run-time element, and you mention a flag to the Hadoop job. That is not the same thing.
If rescoring is slow, it means, well, the IDRescorer is slow. It is called so many times that you certainly need to cache any lookup data in memory. But, reducing the number of candidates per above will also reduce the number of times this is called.
No, don't implement your own similarity. Your issue is not the similarity measure but how many items are considered as candidates.
I am the author of much of the code you are talking about. I think you are wrestling with exactly the kinds of issues most people run into when trying to make item-based work at significant scale. You can, with enough sampling and tuning.
However I am putting new development into a different project and company called Myrrix, which is developing a sort of 'next-gen' recommender based on the same APIs, but which ought to scale without these complications as it's based on matrix factorization. If you have time and interest, I strongly encourage you to have a look at Myrrix. Same APIs, the real-time Serving Layer is free/open, and the Hadoop-based Computation Layer backed in also available for testing.

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

Resources