Are there any chat filters that works depending on the context? I'm talking about the use of new technologies like Artificial Intelligence and Natural Language Processing to determine for example if a word was rude or not, depending on the context.
The simplest way is to just use a regex for the handful of most offensive words.
There are services offered online that will allow you to query with a word to see if it's profane. Those are good options if you want to be really sure.
Unfortunately, there's no sure bet for any of these. One man's profanity is another's common talk (see santorum. very very vulgar, but most aren't offended.) And each group has their own fowl language. ballox isn't so bad in America, but it's fairly bad in Britain.
You could make a clbuttic mistake. Even if you eliminate everything, I can still write a horribly offensive story using the kindest language.
A short black list or one of the services is the way to go depending on the level of filtering you want.
This looks interesting http://pottymouthfilter.com, It is unfortunately commercial product but at least someone is working on something along those lines.
Related
I'm currently working on an internationalisation project for a large web application - initially we're just implementing French but more languages will follow in time. One of the issues we've come across is how to display adjectives.
Let's take "Active" as an example. When we received translations back from the company we're using, they returned "Actif(ve)", as English "Active" translates to masculine "Actif" or feminine "Active". We're unsure of how to display this, and wondered if there are any well established conventions in the web development world.
As far as I see it there are three possible scenarios:
We know at development time which noun a given adjective is referring to. In this case we can determine and use the correct gender.
We're referring to a user, either directly ("you") or in the third person. Short of making every user have a gender, I don't see a better approach than displaying both, i.e. "Actif(ve)"
We are displaying the adjective in isolation, not knowing which noun it's referring to. For example in a table of data, some rows might be dealing with a masculine entity, some feminine.
Scenarios 2 and 3 seem to be the toughest ones. Does anyone have any experience handling these issues? Any tips would be appreciated!
This is complex, because we cannot imagine all the cases, and there is risk to go in "opinion based" answer, so I keep it short and generic.
Usually I prefer to give context in translation (for translator), e.g. providing template: _("active {user_name}" (so also the ordering will be correct if languages want different ordering).
Then you may need to change code and template into _("active {first_name_feminine}") and _("active {first_name_masculine}") (and possibly more for duals, trials, plurals, collectives, honorific, etc.). Note: check that the translator will not mangle the {} and the string inside. Usually you need specific export/import scripts. Or I add a note inside the string, and I quickly translate into English removing the note to the translator). Also this can be automated (be creative on using special Unicode characters which should not be used in normal text, to delimit such text).
But if you cannot know the gender, the Actif(ve) may be the polite version used in such language. You need a native speaker test, and changes back and forth.
I got a financial application and I wish to add to it the ability to get user command or input in textbox and then take the right action. for example, wish the user to write "show the revenue in the last 10 days" and it'll show the revenue to him/her - the point is that I wish it to really understand the meaning of the question, so the previus statement will bring the same results as "do I got any revenue in the last 10 days" or something like that - BI (something like the Wolfram|Alpha engine).
I wonder if there's any opensource library or algorithm books or whatever that I can use to learn the subject. Regards to opensource libraries - I don't mind which language it'll be written in.
I've read about this subject and saw many engines and services (OpenNLP, Apache UIMA, CoreNLP etc.) but did not figure out if they're right for my needs.
Any answer or suggestion is welcome.
Many thanks!
The field you're talking about is usually called "natural language processing". It's hard, and an active field of research. There are various libraries which you could consider based on your preferred programming language and use case:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
I've used NLTK a little bit. This field is seriously difficult to get right, so you might want to try to restrict your application to some small set of verbs and nouns such that people are using a controlled vocabulary in the first instance, and then try to extend it beyond that.
I have users that have authenticated with a social media site. Now based on their last X (let's say 200) posts, I want to map how much that content matches up with a finite list of keywords.
What would be the best way to do this to capture associated words/concepts (maybe that's too difficult) or just get a score of how much, say, my tweet history maps to 'Walrus' or 'banana'?
Would a naive Bayes work here to separate into 'matches' and 'no match'?
In Python I would say NLTK can easily do it. In Ruby maybe gem called lda-ruby will help you. Whole LDA concept is well explained here - look at Sarah Palin's email for example. There's even the example of an app (not entirely in Ruby, but still) which did that -> github.com/echen/sarah-palin-lda
Or maybe I just say stupid things and that can't help you at all. I'm not an expert ;)
A simple bayes would work in this case, it is highly used to detect if emails are spam or not so for a simple keyword matching it should work pretty well.
For this problem you could also apply a recommendation system where you look for the top recommended keyword for a user (or for a post).
There are a ton of ways for doing this. I would recommend you to read Programming Collective Intelligence. It is explained using python but since you know ruby there should be not problem to understand the code.
I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.
Are there any naming conventions or standards for Url parameters to be followed. I generally use camel casing like userId or itemNumber. As I am about to start off a new project, I was searching whether there is anything for this, and could not find anything. I am not looking at this from a perspective of language or framework but more as a general web standard.
I recommend reading Cool URI's Don't Change by Tim Berners-Lee for an insight into this question. If you're using parameters in your URI, it might be better to rewrite them to reflect what the data actually means.
So instead of having the following:
/index.jsp?isbn=1234567890
/author-details.jsp?isbn=1234567890
/related.jsp?isbn=1234567890
You'd have
/isbn/1234567890/index
/isbn/1234567890/author-details
/isbn/1234567890/related
It creates a more obvious data structure, and means that if you change the platform architecture, your URI's don't change. Without the above structure,
/index.jsp?isbn=1234567890
becomes
/index.aspx?isbn=1234567890
which means all the links on your site are now broken.
In general, you should only use query strings when the user could reasonably expect the data they're retrieving to be generated, e.g. with a search. If you're using a query string to retrieve an unchanging resource from a database, then use URL-rewriting.
There are no standards that I'm aware of. Just be mindful of IE's URL length limit of 2,083 characters.
Standard for URI are defined by RFC2396.
Anything after the standardized portion of the URL is left to you.
You probably only want to follow a particular convention on your parameters based on the framework you use.
Most of the time you wouldn't even really care because these are not under your control, but when they are, you probably want to at least be consistent and try to generate user-friendly bits:
that are short,
if they are meant to be directly accessible by users, they should be easy to remember,
case-insensitive (may be hard depending on the server OS).
follow some SEO guidelines and best practices, they may help you a lot.
I would say that cleanliness and user-friendliness are laudable goals to strive for when presenting URLs.
StackOverflow does a fairly good job of it.
I use lowercase. Depending on the technology you use, QS is either threated as case-sensitive (eg. PHP) or not (eg. ASP). Using lowercase avoids possible confusion.
Like the other answers I've not heard about any conventions.
The only "standard" I would adhere to is to use the more search engine friendly practice of using a URL rewriter.
There are no standards that I know of, and case shouldn't matter.
However within your application (website), you should stick to your own standards. For your own sanity if nothing else.