Grammar parsing in Ruby - ruby

I have a task ahead of me which relies on interpreting structure of a text – to be precise, a monolingual dictionary. The dictionary has quite complex entries: up to 29 unique elements, and some are nested within others. I am designing my own XML schema for the dictionary, but I would like to write a program that parses the plain text I have automatically.
I have some basic skills in Ruby and I am a rather experienced RegEx user, but I think creating lots of if-trees and extremely long RegEx formulas is prboably not the best idea. I have found some information on Parsing Expression Grammar, Backus Normal Form and W-grammar, but it seems somewhat vague to what they apply best.
My question is: which is best way to interpret the structure of a text written in a natural language? I don't want to interpret the language itself, but rather to divide each entry into segments based on characters and keyword used, as well as their neighborhood. What gems and resources would you suggest?
Edit: here's an example of a moderately simple entry from the dictionary (in Polish). What I want to do is to tag each element (senses, explanations, collocation, label markers etc.). As you can see, I am looking for an efficient way to encompass a large number of cases in a tree-like form.
Another problem is that I want to have lots of captures, as I want to tag the segments in XML from bigger to smaller.

This looks like a problem that would be well suited for Treetop. I don't think I have enough information to be sure that it will work, but being able to combine regular expressions into a larger structure where each of the 29 elements can be managed and their information extracted/represented using any of Ruby's features as appropriate, seems like the sort of feature set you need.

Related

Is it possible to get the meaning of each word using BERT?

I'm a linguist, I'm new to AI and
I would like to know if BERT is able to get the meaning of each word based on context.
I've done some searches and found that BERT is able to do that and that if I'm not wrong, it recognizes them/ converts them into unique vectors, but that's not the output I want.
What I want is to get the meaning/ or the components that constitute the meaning of each word, written in plain English, is this possible?
No, you can not get the meaning of the word in plain english. The whole idea of the BERT is to convert plain english into meaningful numerical representations.
Unfortunately, these vectors are not interpretable. It is a general limitation of Deep Learning compared to other traditional ML models that use self-extracted features.
But note that you can use these representation to find out certain relationships between words. For example the words that are close to each other (in terms of some distance measure), have similar meanings. Have a look at this link for more information.
https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html

Data Structure to Implement Text Editor?

Recently I was asked this question in an interview. The exact question was
What data structures will you use to implement a text editor. Size of editor can be changed and you also need to save the styling information for all the text like italic, bold etc ?
At that point of time, I tried to convince him using many different approaches like stack, Doubly Linked list and all.
From that point of time,This question is bugging me.
It looks like they'd like to know if you were aware of the flyweight pattern and how to use it correctly.
A text editor is a common example while describing that pattern.
Maybe your interviewer was a lover of the GOF book. :-)
In addition to the previous answers, I would like to add that in order to get to the data structures, you need first to know your design - otherwise the options will be too broad selected.
As example let's assume that you'll need an editing functionality. Here the State and Memento design patterns will be a good fit. Very suitable structure will be the Cord, since it's
composed of smaller strings that is used for efficiently storing and manipulating a very long string.
In our case the text editing program
may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.
An open-ended question like this is designed more to see if you can think cogently about making a design that hangs together well, rather than having one, specific answer.
One specialized answer to the question is to use DOM/XML ("Document Object Model"). Markup "languages" are intended to solve this exact problem. You could store the data for the editor in a DOM. One of the advantages of using a DOM is that there are libraries like Xerces that have extensive support for building and managing DOMs, so a lot of your work is done for you. It is possible the interviewer intended this to be the ideal answer.
A more general answer is that any nested sequence structure can be used. The text can be seen as a sequence of strings. Each elment of the sequence, like rows in a database, can have multiple attributes (font type, font size, italic, bold, strikethrough, etc). Nesting (hierarchy) is useful because the document might have structure such as chapters, sections, paragraphs. For example, if a paragraph has its own styling (indent), then it may need to have its own level. So you have something like this:
Document
Chapter
Paragraph
Text
To implement this, you would use a tree and each node of the tree would have multiple attributes. You would require different kinds of nodes (Chapter nodes, Paragraph nodes, etc). So, for example, a document like a paper would have multiple Section nodes and a Notes node inside a Document node, but a book-like document might have Chapter nodes inside a document node. The advantage of this approach is that it is more specific and hand-tailored to the problem than using a DOM, which is a more flexible approach.
You could also combine the two approaches, using a DOM as your base structure and the hierarchical structure described above as your DOM implementation.
(Note: in the future you should post questions like this to https://softwareengineering.stackexchange.com/)

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Identifying names and places in literature

I have been playing around with Markov Chain Text Generation and Naive Bayes classifiers. I am wondering if there is a way to apply either of those concepts towards identifying certain types of words in a novel. E.G. Last names or place names
I can look through my markov chain and I see that certain words tend to relate the same way to certain other types of words. E.G. Mr. frequently comes before a last name, 'went to' tends to come before a place name and last names tend to follow first names.
Is there a good way that I can write a program that will take a list of example names and then go through a large set of books and identify all words like those names with decent accuracy? Is English regular enough for this to work? Has this been done before? Would this method have a name?
Thanks,
Andrew
In fact, there are only few patterns for names, e.g.:
{FirstName}{Space}{Token with big first char}
{BigCharacter}{Dot}{Space}{Token with big first char}
{"Mr" | "Ms"}{Dot}{Space}{Token with big first char}
and several more. All you need is a dictionary of first names and simple engine to catch such patterns. There's a good framework for this (and many other things) - GATE. It has very large dictionary of first names and special pattern language (JAPE) for manipulating token sequences. You can use it directly or just get the dictionary and implement the logic by yourself.

How do I data mine text?

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

Resources