Mallet CRF SimpleTagger phrases/multi words - pos-tagger

I am a newbies to Mallet, I am trying use mallet Simple tagger/CRF and experimenting with phrases - I tried lookup the documentation on mallet site and also went through the user archives - nothing helped.
I tried training mallet for simple tagging, Its works resonable well.. Here is how my data looks like
(Pls note there is a newline between the training to indicate they are different set)
Sample training data:
where STOPWORD
is STOPWORD
chicago CITY
<---Newline---->
Sunnyvale CITY
<---Newline---->
Chicago CITY
<---Newline---->
Washington CITY
<---Newline---->
What STOPWORD
is STOPWORD
Sunnyvale CITY
time ASK
<---Newline---->
new STOPWORD
<---Newline---->
place STOPWORD
The problem I have is when city names are multi words, Say
new york CITY
Pls note that in the above training data "new" is a STOPWORD
Questions
For Simple tagger, Is the above representation fine ? If not how do I represent pharses ?
If not how to represent data such that SimpleTagger/CRF can use the previous 'n' words to arrive at a tag ? i.e kind of chunk my input

As far as I know, the format you have used for multi word expressions is not correct.
According to here, the format of the input is featre1 feature2 feature3 ....
So, in your case, New is feature1, York is feature 2, etc.
I suggest to use New_York to have your multi word expressions as one word.
Meanwhile, you should notice that you don't have to include the words themselves in the input data. If you do so, they are treated as the first feature. So, if "the word text" or "word lemma" is not an interesting feature to you, throw it out of your input data.

Related

Use case for NER libraries

Hi I am exploring NER libraries to parse through some financial documents, company filings - prospectus etc.
These documents have information like the company name - some keywords and a value associated with them.
I would like to tag and extract these as 3 different entities.
So say for instance I have a phrase or sentence that reads.
ABC corp submitted the following on 1/1/2017 ...We are offering $300,000,000 aggregate principal amount of Floating Rate Notes due 2014 (the “2014 Floating Rate Notes”), $400,000,000 aggregate principal amount of 2.100% Notes due 2014 (the “2014 Fixed Rate Notes”), $400,000,000 aggregate principal amount of 3.100% Notes due 2016 (the “2016 Notes”), and $400,000,000 aggregate principal amount of 4.625% Notes due 2021 (the “2021 Notes”).
I would like to tag ABC corp as organization.
The principal aggregate amount as the key word and
$400000000 as the number value.
I tried running some sample through http://corenlp.run/ it works great for the amounts the keywords and dates - however for the organization name I don't always have it tagged. IS this the standard use case for NER any idea as to why that might be the case for organization name.
Yes the NER model should tag organizations in text. Note that the model was trained on sentences that are different from your data, so performance will drop. Also, the model does not have 100% recall so it will make mistakes from time to time.

ontology-entity-matching from multi-keyword-query / subgraph extraction

Example
I have an ontology with many entities and relations. Let's look at the following example:
The user types "brad pitt angelina jolie" then the system should output the following subgraph
Question
How do I find out which of the tokens "brad", "pitt", "angelina", "jolie" should be taken together to match to the entities Brad Pitt and Angelina Jolie?
In this example the solution would be: Take "brad" and "pitt" together as "brad pitt" to match to the entity Brad Pitt and the second and the third token together as "angelina jolie" to match the entity Angelina Jolie.
Difficulty
I don't know how many entities are contained in the query-string and entities may contain many tokens like Karl Theodor zu Guttenberg.
Guess
I guess that there's probably a lot of research done in this field -- but I'm missing the entry point or the technical terms.

How to quickly search book titles?

I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.
I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.
Basically, I'm looking for a way to:
find the results quickly
sort them by relevance.
I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.
P.S. I'm currently using MySQL, but I could switch to anything else if needed.
Using a SOUNDEX is the best way i think.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This will match 'Saw' etc.
For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').
Example usage:
UPDATE `books` SET `soundex_title` = SOUNDEX(title);
You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...
One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store.
The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:
keyword : set of book titles
Redis already has a built-in set data-type that you can use.
Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.
Everything is done in memory; therefore the response time is very fast.
Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.
Apache Lucene with Solr is definitely a very good option for your problem
You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.
And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.
Say the book "A dime in a dozen". In books table it will be:
book_id=356
book_title='A dime in a dozen'
In titles table will be stored:
titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'
The query for this:
SELECT b.book_id, b.book_title
FROM books b JOIN titles t on b.book_id=t.titles_book_id
WHERE t.titles_title='%twelve%'
GROUP BY b.book_id
Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.

Sort on last name, first name, or both?

I have a dilemma that I've encountered before. What's the best in terms of usability when one displays personal names in a table? Should there be a single column for the name? If so, is "firstname lastname" or "lastname, firstname" preferable? Or would a column for "firstname" and a column for "lastname" be best? I'm thinking in terms of the user's desire to sort the columns. I like having a column for each name component because I can imagine that in some cases the first name will be more important to the user whereas in other cases the last name would be more important.
I would assume that many out there have had this dilemma and am looking for pearls of wisdom based on past experience.
Definitely have a column for each part. That gives you much more flexibility. So you could sort by surname, but print "firstname surname", for example.
If you don't have the screen real estate to have a column for each part, you can combine them into a single string whose format represents the sorting order. Each click on the column header cycles to the next sort order. For example:
Default: sort by last, first (ASC)
Bimbleman, Wally P.
Zonkenstein, Arnold Q.
1st click: sort by last, first (DESC)
Zonkenstein, Arnold Q.
Bimbleman, Wally P.
2nd click: sort by first, middle, last (ASC)
Arnold Q. Zonkenstein
Wally P. Bimbleman
3rd click: sort by first, middle, last (DESC)
Wally P. Bimbleman
Arnold Q. Zonkenstein
etc...
Easier to read an entire name this way (vs. having it span across columns), takes up less screen real estate, and frees you from having to decide upon a single format & sort.
As far as I know, each country has Its own rules to Sort the names, some countries have the uses of do it By First name, and some by Last Name, I believe that the right answer here is, what is about your app? how many users will appear on those columns? And which users (age/nationality/context) are going to use your app?
Really, I agree with Skilldrick - a good UI has at least separate columns for first and last names...
But don't forget that CONSISTENCY in a UI is actually more important and makes things usable: giving the end user an implied expectation of how things are done.
You might consider calling the fields "Given Name" and "Family Name" to account for people who put their family name first. Of course this doesn't cover everyone (some people only have a given name) but it might reduce potential confusion with Chinese and Japanese names, for example.
In most cases you will find that these fields will cover for most scenarios: Title, Firstname, Middlename, Lastname
Most systems that I have worked with here in Australia, data are sorted by their lastname on default display. Also on the screen if you are providing search, usually Lastname field comes before firstname. Sorting by firstname is just as common too, so your systems should always allow the view to switch to sorting by Firstname
Here is a solution for a single column, I don't think separate columns can be scanned and read as quickly, although I don't have any data to back that up.
The primary focus of a user-oriented solution should be to display names as they would be read aloud, i.e. Title Firstname Middlename Lastname.
For most domains where the names are known to the user, sorting by firstname is acceptable. Here is an example where a persons title is ignored in the sorting, and the sort field is clear as it is highlighted:
Arnold Q. Zonkenstein
Mr. David Cliff
Marty P. Bimbleman
For formal business oriented applications, the default sorting could be by surname. You can preserve reading order, while still sorting by last name, again using highlighting:
Marty P. Bimbleman
Mr. David Cliff
Arnold Q. Zonkenstein
If you want the sorting field to be configurable, use an explicit checkbox, the solution of clicking multiple times on the column heading to cycle between sort fields will be jarring to the user (toggling sort direction by clicking on the heading is more acceptable).
IMO this is the simplest solution without any compromises.

How do I go about building a matching algorithm?

I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.
I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.
Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.
Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.
So, how do I go about making an algorithm that can figure this out?
PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.
PPS Some examples would be similar to what m3rLinEz wrote:
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Original'
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:10/20/84 '- Typo in birth date'
ID: 0876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Wrong ID'
ID: 9876234 Fname: Jose LName: Guitierrez-Brown Birthdate:01/20/84 '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez Birthdate:01/20/84 '- Added middle initial'
ID: 3453555 Fname: Joseph LName: Guitierrez Birthdate:01/20/84 '- Probably someone else with same birthdate and same last name'
You might be interested in Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character. It is named
after Vladimir Levenshtein, who
considered this distance in 1965.1
It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}
For example:
Record A - ID: 4831213321, Name: Jane
Record B - ID: 431213321, Name: Jann
Record C - ID: 4831211021, Name: John
The distance between A and B will be lower than A and C / B and C, which indicates better match.
When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..
Good luck!
If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.
If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.
I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.
For the remaining rows you can use a simplified version of Bayes' Theorem.
For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.
For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.
Regular expressions are what you need, why reinvent the wheel?

Resources