Many small models vs. one big model - azure-language-understanding

I am currently implementing kind of a questionnaire with a chatbot and use LUIS to interpret the answers. This questionnaire is divided into segments of maybe 20 questions.
Now the following question came to mind: Which option is better?
I make one LUIS model per question. Since these questions can include implicit subquestions (i.e. "Do you want to pay all at once or in rates" could include a "How high should these rates be?") I end up with maybe 3-5 intents per question (including None).
I can make one model per segment. Let's assume that this is possible and fits in the 80 intents per model.
My first intuition was to think that the first alternative is better since this should be way more robust. When there are only 5 intents to choose from, then it may be easier to determine the right one. As far as I know, there is no restriction in how many models you can have (...is this correct?).
So here is my question for SO: What other benefits/drawbacks are there and is there maybe an option that is objectively the best?

You can have as many models as you want, there is no limit on this. But onto the rest of your question:
You intend to use LUIS to interpret every response? I'm curious as to the actual design of the questionnaire and why you need (or want) open ended responses and not multiple-choice questions. "Do you want to pay all at once or in rates" itself is a binary question. Branching off of this, users might respond with, "Yes I want to pay all at once", which could use LUIS. Or they could respond with, "rates" which could be one of two choices available to the user in a Prompt/FormFlow. "rates" is also much shorter than the first answer and thus a selection that would probably be typed more often than not.
Multiple-choice questions provide a standardized input which would reduce the amount of work you'd have to do in managing your data. It also would most likely reduce the amount of effort needed to maintain the models and questionnaire.
Objectively speaking, one model is more likely to be less work, but we can drill down a little further:
First option:
If your questionnaire segments include 20 questions and you have 2 segments, you have 40 individual models to maintain which is a nightmare.
Additionally, you might experience latency issues depending on your recognizer order, because you have to wait for a response from 40 endpoints. This said it IS possible to turn off recognizers so you might only need to wait for one recognizer. But then you need to manually turn on the next recognizer and turn off the previous one. You should also be aware that handling multiple "None" intents is a nightmare, in case you wish to leave every recognizer active.
I'm assuming that you'll want assistance in managing you models after you realize the pain of handling forty of them by yourself. You can add collaborators, but then you need to add them to multiple models as well. One day you'll (probably) have to remove them from all of the models they have collaborator status on.
The first option IS more robust but also involves a rather extreme amount of work hours. You're partially right in that fewer intents is helpful because of fewer possibilities for the model to predict. But the predictions of your models become more accurate with more utterances and labeling, so any bonus gained by having 5 intents per model is most likely to be rendered moot.
Second option:
Using one model per segment, as mentioned above is less work. It's less work to maintain, but what are some downsides? Until you train your model well enough, there may indeed be some unintended consequences due to false-positive predictions. That said, you could account for that in your questionnaire/bot/questionnaire-bot's code to specifically look for the expected intents for the question and then use the highest scoring intent from this subset if the highest scoring intent overall doesn't match to your question.
Another downfall is that if it's one model and a collaborator makes a catastrophic mistake, it affects the entire segment. With multiple models, the mistake would only affect the one question/model, so that's a plus.
Aside from not having to deal with multiple None-intent handling, you can quickly label utterances that should belong to the None intent. What you label as an intent in a singular model essentially makes it stand out more against the other intents inside of the model. If you have multiple models, an answer that triggers a specific intent in one model needs to trigger the None intent in your other models, otherwise, you'll end up with multiple high scoring intents (and the relevant/expected intents might not be the highest scoring).
End:
I recommend the second object, simply because it's less work. Also, I'm not sure of the questionnaire's goals, but as a general rule, I question the need of putting in AI where it's not needed. Here is a link that talks about factors that do not contribute to a bot's success (note that Natural Language is one of these factors).

Related

Exponential Random Graphs: nodematch() and nodefactor() in R

I have been reading on the web about the Exponential Random Graph modelling for network analysis and I am confused about one aspect. Do I need to add in the model the corresponding nodefactor() for each nodematch() term? Some people would say that: "Nodematch and absdiff are homophily terms – how similar are the two nodes – so in a sense it’s an interaction, say, of both the sender and the receiver being female. You therefore also want the overall effect of, say, being female, so your homophily term isn’t just picking up that women tend to send and receive more ties overall". Is this correct?
The thing is, I have seen many people running ERGM models without including both nodefactor() and nodematch() for the same term. I am just interested in understanding if it is ALWAYS NECESSARY to have both nodefactor() and nodematch() in the model. Or, in contrast, if I run a model containing only the nodematch() term (and this is statistically significant), are the results reliable (can I be sure that the results are not just picking up that certain categories tend to send and receive more ties overall?)?
Looking forward to receiving your answers.
Thank you!

MS LUIS: Number of Intents / Data Imbalance

I am seeing on the LUIS documentation page here that you absolutely recommend to treat Data Imbalance (e.g. the differing number of total unterances compared amongst various intents) as a first priority. We currently see a mean of 19 Utterances per Intent on our dashboard, so in my opinion I should optimize all Intents towards having about 20 Utterances each as an example.
Now my question: When I use active learning by adding Endpoint Utterances, Utterances will be added to the intent we see them fitting (Active Learning Documentation). How can I ensure, that the number of utterances per intent will always remain equal (e.g. around 20 in our example)? In my opinion naturally by attributing endpoint utterances to Intents, a Data Imbalance will be created again.
Thanks a lot!
Best,
Mark
After your initial model is satisfactory, there no longer needs to be equality between intents, active learning specifically tries to correct for cases that were unseen of before, so if other examples already cover all your cases, then you don’t need to actively correct it.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

What's a good set of heuristics for threading tweets?

Everyone knows, if you want to thread emails you use Jamie
Zawinski's algorithm. But it's a new century, and there's a
new messaging service.
What's the best algorithm for threading status updates posted on
Twitter?
Things I'd definitely like it to cope with:
The easy part: using in_reply_to_status_id,
in_reply_to_user_id and in_reply_to_screen_name.
(Incidentally, finding proper documentation of these values
would be useful in itself! Such documentation isn't
obviously linked to from
here,
for example.)
Good heuristics for inferring a "reply" relationship from
messages that mention a user with the # convention but aren't
explicitly in reply to a particular message. These
"mentions" are provided in the "entities" element of
statuses now
if you request that. These heuristics might take into
account (a) the time between two status updates, (b) whether
there are subsquent replies between the two users, etc.
(Replies that consist of an old-style retweet with an
additional comment, as mentioned by user85509
below
are just an instance of this style of reply.)
Conversations that take place between more than two users.
Working with a set of tweets given to the algorithm, or all
tweets on Twitter.
... but perhaps you can think of more.
Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.
The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)
Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.
To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id. Suppose an example of one of these tweets is: 1
#a #b no it isn't lol RT #c Yes, absolutely. /cc #stephenfry
... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of #a, #b, #c, and #stephenfry for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:
The time that elapsed between the two updates - presumably replies are more likely to be to recent updates.
The proportion of the way through the tweet in terms of words that a mention occurs. e.g. if this is the first word, this would be a score of 0 and that's probably more likely to indicate a reply than mentions later in the update.
The number of followers of the mentioned user - celebrities are presumably more likely to be spam-mentioned.
The length of the longest common substring between the updates, which might indicate direct quoting.
Is the mention preceded by "/cc" or other signifiers that indicate that this isn't directly a reply to that person?
The following / followed ratio for the author of the original update.
etc.
etc.
The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.
Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.
1 ... which might be typical of the level of discourse on Twitter ;)
On Twitter, people often write "RT" in front of the message they are replying to.

algorithms to evaluate user responses

I'm working on a web application which will be used for classifying photos of automobiles. The users will be presented with photos of various vehicles, and will be asked to answer a series of questions about what they see. The results will be recorded to a database, averaged, and displayed.
I'm looking for algorithms to help me identify users which frequently don't vote with the group, indicating that they're probably either not paying attention to the photos, or that they're lying about what they see. I then want to exclude these users, and recalculate the results, such that I can say, with a known amount of confidence, that this particular photo shows a vehicle that is this and that.
This question goes out to all you computer science guys, where to find such algorithms or to give myself the theoretical background to design such algorithms. I'm assuming I'm going to have to learn some probability and statics, maybe some data mining. Some book recommendations would be great. Thanks!
P.S. These are multiple choice questions.
All of these are good suggestions. Thank you! I wish there was a way on stack overflow to select multiple correct answers so more of you could be acknowledged for your contributions!!
Read The Elements of Statistical Learning, it is a great compendium on data mining.
You can be interested especially in unsupervised algorithms, for example clustering. Assuming that most people do not lie, the biggest cluster is right and the rest is wrong. Mark people accordingly, then apply some bayesian statistics and you'll be done.
Of course, most data mining technologies are pretty experimentative, so don't count on that they will be always right... or even in most cases.
I believe what you described is solved using outlier/anomaly detection.
A number of techniques exist:
statistical-based methods
distance-based methods
model-based methods
I suggest you take a look at these slides from the excellent book Introduction to Data Mining
If you know what answers you are expecting why do you ask people to vote? By excluding some values you basically turn the vote in something that you like. Automobiles make different impression to different individuals. If 100 ppl loved a car then when someone comes and says that he/she doesn't like it, you exclude the vote?
But anyway, considering that you still want to do this, first of all you will need a large set o data from "trusted" voters. This will give you an idea of "good" answer and from this point you can choose the exclude threshold.
Without an initial set of data you cannot apply any algorithm because you will get false results. Consider just one vote of 100 from on a scale from 0 to 100. The second vote is "1" The you will exclude this vote because is too far away from the average.
I think a pretty simple algorithm could accomplish this for you. You could try and get fancier by calculating the standard deviations and such, but I wouldn't bother.
Here's a simple approach that should be sufficient:
For each of your users, calculate the number of questions they answered and the number of times they selected the most popular answer for the question. The users which have the lowest ratio of picking the popular answer versus total answers you can guess are providing bogus data.
You probably would not want to throw out the data from users where they've only answered a small number of questions because they likely have just disagreed on a few versus putting in bogus data.
What kind of questions are they (Yes/No, or 1 to 10?).
You may be able to get away with not discarding anything by using a mean instead of an average. With averages if there are extreme outliers in the response it could affect the average, but if you use median you may get a better answer. So for example if you had 5 answers, order them and pick the middle one.
I think what you are saying is that you are concerned that certain people are "outliers", and they are adding noise to your data, making the categorizations less reliable. So, if you have a Chevy Camaro, and most people say it is either a pony car, a muscle car, or a sports car, but you have some goofball who says it's a family sedan, you would want to minimize the impact of his vote.
One thing you could do is provide a Stack Overflow-like reputation score for users:
The more a user is "in agreement" with other users, the better his or her score would be. For a given user (User X), this could be determined by a simple calculation of what percentage of users who responded to a question chose the same category as User X, then averaging this value over all questions answered.
You may want to to multiply this value by the total number of question answered to encourage people to answer as many questions as possible. (Note: if you choose to do this, it would be equivalent to just summing the percentage agreement scores rather than averaging them.)
You could present the final reputation score to users, making sure to explain that they will be rewarded for how well their responses agree with those of other users. This will encourage people to answer more questions but also to take care in their answers.
Finally, you could calculate a certainty score for a given categorization by adding up the total reputation score of all people who chose a given category.
Some of these ideas may need some refinement, especially since I don't know your exact situation. Certainly, if people can see what other people chose before they vote, it would be way too easy to game the system.
If you were to collect votes like "on a scale from 1 to 10, how would you rate this car", you could probably use simple average and standard deviation: the smaller the standard deviation, the more unanimous the general consensus is among your voters, and you can flag users who are e.g. 3 standard devs from the average.
For multiple choice, you need to be more careful. Simply discarding all but the most-voted option will do nothing but disgruntle the voters. You need to establish a measure of how significant the winner is w.r.t. the other options, e.g. flag users who voted for options with less than 1/3 of the winning options count.
Note that I wrote "flag users", not discard votes. If you discard votes, you can't tell how confident you are about the result ("91% voted this to be a Ford Mustang"). If a user has more than a certain percentage of his votes flagged - well, that's up to you.
Your trickiest problem, however, will probably be to collect sufficient votes. Depending on how easy the multiple choice problem is, you probably need several times the number of options as votes, per photo. Otherwise the statistics are meaningless.

Resources