AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets? - set

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

Related

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

Can we merge rankings from somewhat-similar data sets to produce a global rank?

Another way of asking this is: can we use relative rankings from separate data sets to produce a global rank?
Say I have a variety of data sets with their own rankings based upon the criteria of cuteness for baby animals: 1) Kittens, 2) Puppies, 3) Sloths, and 4) Elephants. I used pairwise comparisons (i.e., showing people two random pictures of the animal and asking them to select the cutest one) to obtain these rankings. I also have the full amount of comparisons within data sets (i.e., all puppies were compared with each other in the puppy data set).
I'm now trying to merge the data sets together to produce a global ranking of the cutest animal.
The main issue of relative ranking is that the cutest animal in one set may not necessarily be the cutest in the other set. For example, let's say that baby elephants are considered to be less than attractive, and so, the least cutest kitten will always beat the cutest elephant. How should I get around this problem?
I am thinking of doing a few cross comparisons across data sets (Kittens vs Elephants, Puppies vs Kittens, etc) to create some sort of base importance, but this may become problematic as I add on the number of animals and the type of animals.
I was also thinking of looking further into filling in sparse matrices, but I think this is only applicable towards one data set as opposed to comparing across multiple data sets?
You can achieve your task using a rating system, like most known Elo, Glicko, or our rankade. A rating system allows to build a ranking starting from pairwise comparisons, and
you don't need to do all comparisons, neither have all animals be involved in the same number of comparisons,
you don't need to do comparison inside specific data set only (let all animals 'play' against all other animals, then if you need ranking for one dataset, just use global ranking ignoring animals from others).
Using rankade (here's a comparison with aforementioned ranking systems and Microsoft's TrueSkill) you can record outputs for 2+ items as well, while with Elo or Glicko you don't. It's extremely messy and difficult for people to rank many items, but a small multiple comparison (e.g. 3-5 animals) should be suitable and useful, in your work.

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Classifying Text Based on Groups of Keywords?

I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits,
deposit, customer, account, accounts
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
The system shall apply deposits to a
customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters =
CatKWs = an array/hash of lists of strings. The list contains the possible
keywords, for a given category.
usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
NbCats = integer number of pre-defined categories
Variables:
CatAccu = an array/hash of numeric values with one entry per each of the
possible categories. usage: CatAccu[3] = 4 (if array) or
CatAccu['CustTx'] += 1 (hash)
TotalKwOccurences = counts the total number of keywords matches (counts
multiple when a word is found in several pre-defined categories)
Pseudo code: (for categorizing one input item)
1. for x in 1 to NbCats
CatAccu[x] = 0 // reset the accumulators
2. for each word W in Item
for each x in 1 to NbCats
if W found in CatKWs[x]
TotalKwOccurences++
CatAccu[x]++
3. for each x in 1 to NbCats
CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating
4. Sort CatAccu by value
5. Return the ordered list of (CategoryID, rating)
for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.
Also, aside from the calculation of the rating per-se, we should also consider:
some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.
# predictor algorithm
for docs in readContent:
for x in range(len(docKywrdmppr)):
catAccum[x]=0
for i in range(len(docKywrdmppr)):
for word in removeStopWords(docs):
if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
print(word)
catAccum[i]=catAccum[i]+counter
print(catAccum)
ind=catAccum.index(max(catAccum))
print(ind)
predictedDoc.append(docKywrdmppr['Document Type'][ind])

Resources