How would this hierarchy for knowledge representation be extended? - logic

I've tried making a hierarchy of all forms of knowledge - including physical objects, numbers, procedures etc. How could this be improved? How would a sentence such as "Jack is producing music sitting on a tree" fall into this chart? Jack would go into humans, tree into place , but where would music go?

I dont have any experience in this field but looking at it from purely linguistic and logical way, you can put "producing " totally into work. Or maybe branch out your "objects" into different kinds like "products"->{"tech","healthcare"...} etc and "art objects"->{"music","paintings"...}
But yeah you can argue that this is a little more specific just to fit in the sentence.

I do not have a direct answer to your question, but I think I know where you would get more information on this tops. If I am not mistaken, you are trying to develop an ontology.
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.
The most general ontology is developed at Cycorp, although the whole ontology is not freely available, a (large) subset of the onotlogy is available by Opencyc. They have an instalaltion prepared, which enables you to make queries to the ontology via a web-browser. Maybe you should explore this ontology and get some new ideas where to go next.

Related

What matching algorithm could I use?

I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
====================================================
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?
I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.
Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.

Word classification algorithm pro cons

As for college project I am required to build a software that, given some comments concerning a virtual construction site, detects its actual state (just started, in construction, terminated).
For example, given the comments:
"Happy to hear we can walk through the English Channel bridge"
"Yesterday I went to the newly built bridge to have a trip to France with my friends"
"They just finished the site and there are already cracks in the 5th miles. What a letdown!"
The system should detect that the "English Channel bridge" construction site has ended.
At the moment I'm trying to choose what word classification algorithm to use for this project. I searched online looking for the best classification algorithm to use. I've read about SVC but, since I'm not really an expert in this field, I am unsure about the compliance/goodness of SVC with my scenario.
What I'm trying to obtain is not the solution to my problem, but a list of available algorithms with their pros and cons.
You are formulating your problem incorrectly, making it difficult for people to give you a list of pros and cons.
The problem you are describing is not really a word classification problem since you are not classifying words. What you are trying to do is:
Named Entity Recognition for construction projects
Classify each construction Named Entity into 3 different types based on the mention context.
The algorithm is not the real issue. Most classification algorithms (linear regression, decision trees, SVM, etc...) will work.
The problem you actually have (but don't realize based on your question) is that you have no training data for either finding construction project named entities or classifying those entities once you have them into your 3 categories.
My suggestion would be that you use one of the freely available NER toolkits/libraries out there, add in dictionary features related to construction projects (words like bridge, tower, etc...) and see how well you can do at the first part of your task.
More important considerations are:
How much time/money do you have to get annotated data?
What sort of performance do you need?
What language/libraries are you willing toconsider (the least important question IMHO)
I'm sorry, I realize this is probably not the answer you want to hear but I suspect it is the answer you need. ;)

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

How the computer knows "Recommended for You"?

Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.

How do I data mine text?

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

Resources