How can determine how many rules and fuzzy sets we need in our fuzzy system?
Is by increasing the rules and fuzzy sets, the system would be better?
How can we determine how many rules and fuzzy sets we need actually for better results?
Thanks
There are many different methods of determining where you need to model fuzziness in a particular application. The overarching principles to keep in mind are: 1) Look for places where it would be beneficial to treat ordinal or nominal data on a continuous scale, even at the cost of imprecision and 2) The "fuzz" should be naturally present in the data or problem you're trying to solve; it's not a secret ingredient one adds to make an application better, as is sometimes implied by overeager enthusiasts. Only add fuzzy rules and sets where it when you can justify the added computational/data collection/other costs in terms of greater accuracy or some other practical use.
With those principles in mind, here are some ways of detecting places where fuzzy rules and sets might be useful:
• The number one candidate is natural language modeling, perhaps through a Behavioral-Driven Development (BDD) process if you're in a software development environment. For example, you can interview people with domain knowledge and look for naturally fuzzy statements, such as "cloudy," "overcast" and "sunny" in meteorology, or fuzzy numbers, like "about half" or "most." Then find membership functions that most accurately match the meaning assigned to those terms. Note that sometimes terms from multiple fuzzy sets can occur together; for example, you grade the truth of the statement "about half of these days were cloudy," which might require three separate membership functions, one for truth, one for the fuzzy number and a third for the "cloudy" category. Linguistic analysis is the simplest way, since people naturally use fuzzy language every day; be aware though that multiple fuzzy sets can actually be combined to model fuzzy logic curiosities that don't often occur in natural language, like "“John is taller than he is clever,” “Inventory is higher than it is low,” “Coffee is at least as unhealthy as it is tasty,” and “Her last novel is more political than it is confessional.” Those examples come from p. 16, Bilgic, Taner and Turksen, I.B. August 1994, “Measurement–Theoretic Justification of Connectives in Fuzzy Set Theory,” pp. 289–308 in Fuzzy Sets and Systems, January 1995. Vol. 76, No. 3.
• Another important task is sorting out how to model "linguistic connectives" like fuzzy ANDs and ORs, or crisp conjunctions between fuzzy statements. Some guiding principles have been worked out and are available in such sources as Alsina, C.; Trillas E. and Valverde, L., 1983, “On Some Logical Connectives for Fuzzy Sets Theory," pp. 15-26 in Journal of Mathematical Analysis and Applications. Vol. 93; Dubois, Didier and Prade, Henri, 1985, “A Review of Fuzzy Set Aggregation Connectives," pp. 85-121 in Information Sciences, July-August, 1985. Vol. 36, Nos. 1-2.
• Pooling the opinions of experts (as in an expert system) or the subjective scores of others (as in a movie ratings system). The ratings themselves would constitute one level of fuzziness, while another tier could be added to weight the importance of each expert or other individual's particular score, if they're particularly authoritative.
• Another option is to use neural nets to determine whether or not the addition of various fuzzy rules and sets to your model actually improves accuracy or some other metric related to your end goal.
• Other options include estimating membership functions and the parameters of T-norms and T-conorms (which are used often in fuzzy complements, unions and intersections) with such techniques as regression, Maximum Likelihood Estimation (MLE), LaGrange interpolation, curve fitting and parameter estimation. All of these are discussed in my favorite reference for fuzzy set math, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.
• In the same vein, deciding whether or not to include particular fuzzy rules or sets may depend on whether or not you can find a good fit between its underlying and possibly unknown "actual" membership function and the one's your testing. Most of the time triangular, trapezoidal or Gaussian functions will suffice, but in some situations distribution testing might be necessary to find just the right distribution function. Empirical Distribution Functions (EDFs) might come in handy here.
To make a long story short, a lot of different statistical and machine learning techniques can be applied to give ballpark answers to these questions. The key is to always stay within the bounds of the two main principles above and only model things with fuzzy sets when it would serve your practical goals, then leave out the rest. I hope that helps.
Related
Are there any internal validity indices/methods to evaluate the quality of my algorithm, which don't mostly depend on the proximity measure (e.g., distance matrix)?
All the conventional measures (such as: silhouette, Dunn index, N-cut, DB index, etc.) depends on how well you defined a proximity over the data and on the final partition, rather the data itself.
There is no such thing as "depending on the data itself", data is an abstract term which can describe set of elephants or rings isomorphisms. In order to define any index you need to use one of two things:
In supervised scenario (when you know some class of objects, not neceserly use it for trianing, but you have to know them) you can use these labels to calculate impurity, or any other classification derived score
in unsupervised scenario you have to use some similarity measure, which can be very arbitrary, it might be an inverse of some metric, but it might be completely abstract measure derived from asking some people "are these element similar?", it might consists elements that are not comparable ("nans" in the matrix), it might be not symmetric, but some similarity measure is crucial, there is no "magical", "deep" meaning "in the data". You may extract similarity measure from some different models (like generative models, autoencoders etc.) but it is still the same conceptually, simply instead of giving the rules by hand, you give an algorithm by hand which extracts the rules.
To sum up. You cannot evaluate a clustering as such, you can only evaluate how well it works for a particular task, and this task can be:
some bigger problem, where clustering is just one of the steps and you plug your clustering and observe change in the whole system's quality
optimization of some class-based criterion (supervised)
optimization of some similarity/distance based criterion (unsupervised)
There are no more options. Unsupervised learning is not a real, well posed problem, this is only a tool to simplify some real problems. As a result you won't ever be able to say "this clustering is good", you might only say that "this clustering is good in task A,B,C under assumption of pipelines X,Y,Z"
I've read a couple of introductory sections of books as well as a few papers on both topics, and it looks to me that these two methods are pretty much exactly the same. That said, I haven't had the time to actually deeply research the topics yet, so I might be wrong.
What are the distinctions between genetic algorithms and evolution strategies? What makes them different, and where are they similar?
In evolution strategies, the individuals are coded as vectors of real numbers. On reproduction, parents are selected randomly and the fittest offsprings are selected and inserted in the next generation. ES individuals are self-adapting. The step size or "mutation strength" is encoded in the individual, so good parameters get to the next generation by selecting good individuals.
In genetic algorithms, the individuals are coded as integers. The selection is done by selecting parents proportional to their fitness. So individuals must be evaluated before the first selection is done. Genetic operators work on the bit-level (e.g. cutting a bit string into multiple pieces and interchange them with the pieces of the other parent or switching single bits).
That's the theory. In practice, it is sometimes hard to distinguish between both evolutionary algorithms, and you need to create hybrid algorithms (e.g. integer (bit-string) individuals that encodes the parameters of the genetic operators).
Just stumbled on this thread when researching Evolution Strategies (ES).
As Paul noticed before, the encoding is not really the difference here, as this is an implementation detail of specific algorithms, although it seems more common in ES.
To answer the question, we first need to do a small step back and look at internals of an ES algorithm.
In ES there is a concept of endogenous and exogenous parameters of the evolution. Endogenous parameters are associated with individuals and therefore are evolved together with them, exogenous are provided from "outside" (e.g. set constant by the developer, or there can be a function/policy which sets their value depending on the iteration no).
The individual k consists therefore of two parts:
y(k) - a set of object parameters (e.g. a vector of real/int values) which denote the individual genotype
s(k) - a set of strategy parameters (e.g. a vector of real/int values again) which e.g. can control statistical properties of mutation)
Those two vectors are being selected, mutated, recombined together.
The main difference between GA and ES is that in classic GA there is no distinction between types of algorithm parameters. In fact all the parameters are set from "outside", so in ES terms are exogenous.
There are also other minor differences, e.g. in ES the selection policy is usually one and the same and in GA there are multiple different approaches which can be interchanged.
You can find a more detailed explanation here (see Chapter 3): Evolution strategies. A comprehensive introduction
In most newer textbooks on GA, real-valued coding is introduced as an alternative to the integer one, i.e. individuals can be coded as vectors of real numbers. This is called continuous parameter GA (see e.g. Haupt & Haupt, "Practical Genetic Algorithms", J.Wiley&Sons, 1998). So this is practically identical to ES real number coding.
With respect to parent selection, there are many different strategies published for GA's. I don't know them all, but I assume selection among all (not only the best has been used for some applications).
The main difference seems to be that a genetic algorithm represents a solution using a sequence of integers, whereas an evolution strategy uses a sequence of real numbers -- reference: http://en.wikipedia.org/wiki/Evolutionary_algorithm#
As the wikipedia source (http://en.wikipedia.org/wiki/Genetic_algorithm) and #Vaughn Cato said the difference in both techniques relies on the implementation. EA use
real numbers and GA use integers.
However, in practice I think you could use integers or real numbers in the formulation of your problem and in your program. It depends on you. For instance, for protein folding you can say the set of dihedral angles form a vector. This is a vector of real numbers, but the entries
are labeled by integers so I think you can formulate your problem and write you program based
on an integer arithmetic. It is just an idea.
I have a question that is somewhat high level, so I'll try to be as specific as possible.
I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.
My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.
My question is the following:
Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?
I'd prefer something that had practical examples and a well defined approach.
The approach could be iterative, with human checks for improvement at intermediate stages.
(edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".
If they were in Python that would be fantastic, but not necessary.
One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.
One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.
The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python
A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.
The problem is that you have an array of distances, at least one for each column, and you want to combine those distances in an optimal way to indicate whether a pair of records are the same thing or not.
This is a problem of classification, there are many ways to do it, but logistic regression is one of simpler methods. To train a classifer, you will need to label some pairs of records as either matches or not.
The dedupe python library helps you do this and other parts of the difficult task of record linkage. The documentation has a pretty good overview of how to approach the problem of record linkage comprehensively.
I'm reading about fuzzy logic and I just don't see how it would possibly improve machine learning algorithms in most instances (which it seems to be applied to relatively often).
Take for example, k nearest neighbors. If you have a bunch a bunch of attributes like color: [red,blue,green,orange], temperature: [real number], shape: [round, square, triangle], you can't really fuzzify any of these except for the real numbered attribute (please correct me if I'm wrong), and I don't see how this can improve anything more than bucketing things together.
How can machine fuzzy logic be used to improve machine learning? The toy examples you'll find on most websites don't seem to be all that applicable, most of the time.
Fuzzy logic is advisable when the variables have a natural shape interpretation. For example, [very few, few, many, very many] have a nice overlapping trapezoid interpretation of values.
Variables like color might not. Fuzzy variables denote degree of membership, that's when they become useful.
Regarding machine learning, it depends on what stage of the algorithm you want to apply fuzzy logic. It would be better applied in my opinion after the clusters are found (using traditional learning techniques) to determining the degree of membership of a certain point in the search space on each cluster, but that doesn't improve learning per see, but classification after learning.
[round, square, triangle] are mostly ideal categories, which exist primarily in geometry (i.e. in theory). In real world, some shapes might be almost square or more or less round (circular shape). There are many nuances of red, and some colors are closer to some others (ask a woman to explain turquoise, for example). Hence, also abstract categories and some specific values are useful as references, in real world the objects or values are not necessarily equals to these ones.
Fuzzy membership allow you to measure how far are some specific objects from some ideal. Using this measure lets one to avoid "no, it's not circular" (which might lead to information loss) and make use of the measure the given object is (not) circular.
In my view, fuzzy logic is not a practically viable approach to anything unless you are building a purpose build fuzzified controller or some rule based structure like for compliance/policies. Although, fuzzy implies dealing with everything between and including 0 and 1. It, however, I find is a bit flawed when you approach more complicated problems where you need to apply fuzzy logic aspects in 3 dimensional spaces. You can still approach multivariate without having to look at fuzzy logic. Unfortunately, for me having studied fuzzy logic I found myself disagreeing with the principles approached in fuzzy sets in large dimensional spaces it seems infeasible, unpractical, and not very logically sound. The natural language base that you would be applying in your fuzzy set solution will also be very adhoc what exactly is [very,few, many] this is all what you define in your application.
Alot, of machine learning aspects you will find that you don't even have to go so far as to build natural language underpinnings into your model. In fact, you will find you can achieve even better results without having to apply fuzzy logic into any aspect of your model.
just too irritate you a bit by forcibly adding fuzziness to this. if instead of the "shape" attribute you had a "number of sides" attribute which would have been further divided into "less", "medium", "many" and "uncountable". the square could have been a part of "less" and "medium" both given the appropriate membership function. in place of the "color" attribute, if you had "red" attribute, then using the RGB code, a membership function could have been made. so as my experience in data mining says, every method can be applied to every dataset, what works, works.
Couldn't one just convert discrete sets into continuous ones and get the same effects as fuzziness, while being able to use all the techniques of probability theory?
For instance size ['small', 'medium', 'big'] ==> [0,1]
It's not clear to me what you're trying to accomplish in the example you give (shapes, colors, etc.). Fuzzy logic has been used successfully with machine learning, but personally I think it is probably more often useful in constructing policies. Rather than go on about it, I refer you to an article I published in the Mar/Apr-2002 issue of "PC AI" magazine, which hopefully makes the idea clear:
Putting Fuzzy Logic to Work: An Introduction to Fuzzy Rules
input: phrase 1, phrase 2
output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing
You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).
You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.
Regards,
Matt.
There's a short and a long answer to this.
The short answer:
Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.
The long answer:
Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.
So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.
For anyone just coming at this, i would suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They implement a lot of the modern research methods for calculating word and sentence similarity. It is written in Java.
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts.
You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.
Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.
I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.
Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.
cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you can use it in your own application without having to implement the algorithm yourself.
One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.
To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.
This approach has been described in
J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x
I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:
1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.
The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.
Summarizing, I would suggest you have a look at:
1) String similarity measures;
2) Statistic methods;
Hope this helps.
Try SimService, which provides a service for computing top-n similar words and phrase similarity.
This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.
Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.