parsing data and POS with treetop vs. stanford nlp - ruby

I'm trying to parse event (concerts, movies, etc. etc.) data in Ruby and can't decide on what tool to use.
I thought the stanford parser was the way to go initially, but then heard of treetop.
I'm struggling with both, as getting the stanford parser to work with Ruby on Windows has taken up two+ days of searching and struggling and no end of errors in just getting it installed.
Treetop installed no problem, but the documentation is very limited, and from what I can gather, it seems that treetop is best at dealing with a grammar structure than the actual content, but maybe I'm just not completely understanding Treetop capabilities.
One of the nice things (I think) is that I have is a large database/corpus(?) of band and movie names, and a fairly limited parts of data that I'm looking to retrieve.
For instance one listing is
The Tragically Hip with Guest Hey Rosetta!, Friday Jul 15th, 7:30pm, Deer Lake Park
Another listing is
07/08/11 - Tacoma Dome, New Kids on the Block & Backstreet Boys w/ Matthew Morrison, 7:30pm, Tacoma, WA
With each listing I'm trying to grab a rather specific group of details, being who/what, date, time, city, venue.
Seeing as I already have a dataset of band names, and city names should be fairly easy to get a listing of, it should be 'fairly' easy to pick out the other details, I'm just not sure which tool I should dedicate my time to, or if there is a better way to do this?
Any suggestions?

No, treetop is used to parse more structured languages (like computer languages). For Natural Language Parsing (NLP), you'd better use The Stanford Parser or something like it. Have a look at this blog entry about NLP in combination with Ruby:
http://mendicantbug.com/2009/09/13/nlp-resources-for-ruby/

Related

I want to edit a wellformatted excel file with ruby

I have a wellformatted excel file with a lot of macros and styling in it that I want to keep.
Then i have this information I want to enter in the file.
And I want to do it with ruby.
I've tried roo and spreadsheet but they don't seem able to actually edit the file, just create a new one and loosing all the formattin in the process.
It feels it should be simple to just edit the cells I want and save the file again but obviously it's more complex that I originally though(or I'm completely blind)
Any help is appreciated.
I'm learning ruby at the moment so that's why I would prefer a solution in ruby.
If you know there are better suited laguages for this feel free to point me in the right direction and I'll check it out.
Thanks in advance
Speaking from experience, there is no Ruby gem that would handle Excel files with all bells, whistles, macros and styling. It is a pity, because Excel is squarely the finest of Microsoft products. In my experience, spreadsheet library can import legacy data from Excel, LibreOffice Calc etc. (I'm not sure about Gnumeric).
As for your problem of getting data from Ruby to Excel, I suggest that you first save Ruby output as a separate file (spreadsheet, CSV, text...) and then teach Excel to import it (eg. using macros).
Another possibility is to abandon Excel for data processing tasks (and possibly keep it for data presentation tasks). Excel is great for presentation and simple data processing, but very bad for complex algorithms.
I wrote gems yzz and y_nelson, which I intended as Ruby replacement for spreadsheets. Yzz provides Ted Nelson's ZZ structures in Ruby (ZZ structure is an improved version of spreadsheet data structure) and y_nelson mixes it with Petri nets (because Petri nets are an improved version of Excel cell functions). Mathematically speaking, a spreadsheed is a hybrid between some sort of multidimensional orthogonal grid of data cells plus a Petri net execution engine. With y_nelson, I hope to bring dearly missed Excel functionality into Ruby, while at the same time moving one step towards better abstraction.

text mining/analyse user commands/questions algorithm or library

I got a financial application and I wish to add to it the ability to get user command or input in textbox and then take the right action. for example, wish the user to write "show the revenue in the last 10 days" and it'll show the revenue to him/her - the point is that I wish it to really understand the meaning of the question, so the previus statement will bring the same results as "do I got any revenue in the last 10 days" or something like that - BI (something like the Wolfram|Alpha engine).
I wonder if there's any opensource library or algorithm books or whatever that I can use to learn the subject. Regards to opensource libraries - I don't mind which language it'll be written in.
I've read about this subject and saw many engines and services (OpenNLP, Apache UIMA, CoreNLP etc.) but did not figure out if they're right for my needs.
Any answer or suggestion is welcome.
Many thanks!
The field you're talking about is usually called "natural language processing". It's hard, and an active field of research. There are various libraries which you could consider based on your preferred programming language and use case:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
I've used NLTK a little bit. This field is seriously difficult to get right, so you might want to try to restrict your application to some small set of verbs and nouns such that people are using a controlled vocabulary in the first instance, and then try to extend it beyond that.

Matching users with objects based on keywords and activity in Ruby

I have users that have authenticated with a social media site. Now based on their last X (let's say 200) posts, I want to map how much that content matches up with a finite list of keywords.
What would be the best way to do this to capture associated words/concepts (maybe that's too difficult) or just get a score of how much, say, my tweet history maps to 'Walrus' or 'banana'?
Would a naive Bayes work here to separate into 'matches' and 'no match'?
In Python I would say NLTK can easily do it. In Ruby maybe gem called lda-ruby will help you. Whole LDA concept is well explained here - look at Sarah Palin's email for example. There's even the example of an app (not entirely in Ruby, but still) which did that -> github.com/echen/sarah-palin-lda
Or maybe I just say stupid things and that can't help you at all. I'm not an expert ;)
A simple bayes would work in this case, it is highly used to detect if emails are spam or not so for a simple keyword matching it should work pretty well.
For this problem you could also apply a recommendation system where you look for the top recommended keyword for a user (or for a post).
There are a ton of ways for doing this. I would recommend you to read Programming Collective Intelligence. It is explained using python but since you know ruby there should be not problem to understand the code.

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

What's needed for NLP?

assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.

Resources