Related
I am quite new to R, I am trying to do a Corresp analysis (MASS package) on summarized data. While the output shows row and column score, the resulting biplot shows the column scores as zero, making the plot unreadable (all values arranged by row scores in an expected manner, but flat along the column scores).
the code is
corresp(some_data)
biplot(corresp(some_data, nf = 2))
I would be grateful for any suggestions as to what I'm doing wrong and how to amend this, thanks in advance!
Martin
link to the image
the plot
corresp results
As suggested here:
http://www.statsoft.com/textbook/correspondence-analysis
the biplot actually depicts distributions of the row/column variables over 2 extracted dimensions where the variables' dependency is "the sharpest".
Looks like in your case a good deal of dependencies is concentrated along just one dimension, while the second dimension is already mush less significant.
It does not seem, however, that you relationships are weak. On the contrary, looking at your graph, one can observe the red (column) variable's interception with 2 distinct regions of the other variable values.
Makes sense?
Regards,
Igor
Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:
continuous
If the attribute has a continuous value.
discrete
The word 'discrete' followed by an integer which indicates how many values the attribute can take.
list of identifiers
This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.
ignore
means the attribute should be ignored - it won't be used.
Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?
For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values;
"QueryString" attribute can have infinite number of possible values ... etc.
Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?
Thanks.
The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.
Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.
I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n matrix like A = {aij} where aij is the number of data points that are members of class ci and elements of cluster kj.
But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
Wikipedia's definition:
In the field of artificial intelligence, a confusion matrix is a
visualization tool typically used in supervised learning (in
unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class.
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
predicted class
c1 - c2
Actual class c1 15 - 3
___________________
c2 0 - 2
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)
Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table:
Wikipedia's definition:
In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
It is often used to record and analyze the relation between two or
more categorical variables.
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
Coffee !coffee
tea 150 50 200
!tea 650 150 800
800 200 1000
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.
Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.
I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).
Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.
I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)
I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.
You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)
I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches