It seems easy to parallelize parsers for large amounts of input data that is already given in a split format, e.g. a large list of individual database entries, or is easy to split by a fast preprocessing step, e.g. parsing the grammatical structure of sentences in large texts.
A bit harder seems to be parallel parsing that already requires quite some effort to locate sub-structures in a given input. Common programming language code looks like a good example. In languages like Haskell, that use layout/indentation for separating individual definitions, you could probably check the number of leading spaces of each line after you've found the start of a new definition, skip all lines until you find another definition and pass each skipped chunk to another thread for full parsing.
When it comes to languages like C, JavaScript etc., that use balanced braces to define scopes, the amount of work for doing the preprocessing would be much higher. You'd need to go through the whole input, thereby counting braces, taking care of text inside string literals and so on. Even worse with languages like XML, where you also need to keep track of tag names in the opening/closing tags.
I found a parallel version of the CYK parsing algortihm that seems to work for all context-free grammars. But I'm curious what other general concepts/algorithms do exist that make it possible to parallelize parsers, including such things as the brace counting described above which would only work for a limited set of languages. This question is not about specific implementations but the ideas such implementations are based on.
I think you will find McKeeman's 1982 paper on Parallel LR Parsing quite interesting, as it appears to be practical and applies to a broad class of grammars.
The basic scheme is standard LR parsing. What is clever is that the (presumably long) input is divided into roughly N equal sized chunks (for N processors), and each chunk is parsed separately. Because the starting point for a chunk may (must!) be in the middle of some of productions, McKeemans individual parsers, unlike classic LR parsers, start with all possible left contexts (requiring that the LR state machine be augmented) to determine which LR items apply to the chunk. (It shouldn't take very many tokens before an individual parser has determined what states really apply, so this isn't very inefficient). Then the results of all the parsers are stitched together.
He sort of ducks the problem of partitioning the input in the middle of a token. (You can imagine an arbitrarily big string literal containing text that looks like code, to fool the parser the starts in the middle). What appears to happen is that parser runs into an error, and abandons its parse; the parser to its left takes up the slack. One can imagine the chunk splitter to use a little bit of smarts to mostly avoid this.
He goes to demonstrate a real parser in which speedups are obtained.
Clever, indeed.
Related
I have a very large file (1.5M rows), containing json dictionaries in each row.
Each row contains a parsed Wikipedia article.
For example
{"title": "article title", "summary": "this is a summary of around 500 words", "article": "This is the whole article with more than 3K words"}
{"title": "article2 title", "summary2": "this is another summary of around 500 words", "article": "This is another whole article with more than 3K words"}
Note that the file is not itself a json.
I want to compute some statistics on these texts, e.g mean number of sentences, mean number of words, compression ratio etc. However, everything I try takes ages.
What is the fastest way to go with this? For reference, at the moment I am using spacy for word and sentence tokenization, but I am open to more approximate solutions e.g. using regex, if they are the only way.
If you want to achieve high performance, then you should probably compute the lines in parallel using multiple threads and each line should extract the target metric using a SIMD-friendly code. It is also probably a good idea to simplify the parsing by using a specialized code working only on this problem and not general parsing tool (like regular expression, unless the target engine is capable of producing a very fast linear-time JIT-compiled efficient code).
For the multithreading part, this is certainly the easiest part since the computation appear to be mostly embarrassingly parallel. Each thread compute the target metrics on a chunks of lines and can then perform a parallel reduction (ie. sum) of the target metrics.
Each line can be parsed relatively quickly using SimdJson. Since the JSON documents are small and the structure appear to be simple and always the same, you can use a regular expression to search for "article" *: *"((?:[^"\]+|\")*)" (note that you may need to escape the backslash regarding the language used). However, the best strategy is probably to parse yourself the JSON document to extract the wanted string much more efficiently for example by searching for some very specific key pattern/string like " (with a SIMD-friendly loop) followed by article and then parse the rest using a more robust (but slower) method.
Similar strategies apply to count words. A fast over-approximation is to count the number of space directly followed by a character. The string encoding matters to speed up parsing as decoding UTF-8 string is generally pretty slow. One fast solution is to just discard non-ASCII character if the target language is English or mostly use ASCII characters. If this is not the case, then you can use some SIMD-aware UTF-8 decoding library (or hand written algorithm in the worst case). Working on small chunks of about 1KB can help to use the CPU cache more efficiently (and to auto-vectorize your code if you use a compiled native language like C or C++).
If you are not very familiar with SIMD instructions, or low-level parsing strategies/algorithms (like this one), note that there are some fast parsing libraries to do basic operation efficiently like Hyperscan.
I have a task ahead of me which relies on interpreting structure of a text – to be precise, a monolingual dictionary. The dictionary has quite complex entries: up to 29 unique elements, and some are nested within others. I am designing my own XML schema for the dictionary, but I would like to write a program that parses the plain text I have automatically.
I have some basic skills in Ruby and I am a rather experienced RegEx user, but I think creating lots of if-trees and extremely long RegEx formulas is prboably not the best idea. I have found some information on Parsing Expression Grammar, Backus Normal Form and W-grammar, but it seems somewhat vague to what they apply best.
My question is: which is best way to interpret the structure of a text written in a natural language? I don't want to interpret the language itself, but rather to divide each entry into segments based on characters and keyword used, as well as their neighborhood. What gems and resources would you suggest?
Edit: here's an example of a moderately simple entry from the dictionary (in Polish). What I want to do is to tag each element (senses, explanations, collocation, label markers etc.). As you can see, I am looking for an efficient way to encompass a large number of cases in a tree-like form.
Another problem is that I want to have lots of captures, as I want to tag the segments in XML from bigger to smaller.
This looks like a problem that would be well suited for Treetop. I don't think I have enough information to be sure that it will work, but being able to combine regular expressions into a larger structure where each of the 29 elements can be managed and their information extracted/represented using any of Ruby's features as appropriate, seems like the sort of feature set you need.
Are there any papers on state of the art UTF-8 validators/decoders. I've seen implementations "in the wild" that use clever loops that process up to 8 bytes per iteration in common cases (e.g. all 7-bit ASCII input).
I don't know about papers, it' probably a bit too specific and narrow a subject for strictly scientific analysis but rather an engineering problem. You can start by looking at how this is handled different libraries. Some solutions will use language-specific tricks while others are very general. For Java, you can start with the code of UTF8ByteBufferReader, a part of Javolution. I have found this to be much faster than the character set converters built into the language. I believe (but I'm not sure) that the latter use a common piece of code for many encodings and encoding-specific data files. Javolution in contrast has code designed specifically for UTF-8.
There are also some techniques used for specific tasks, for example if you only need to calculate how many bytes a UTF-8 character takes as you parse the text, you can use a table of 256 values which you index by the first byte of the UTF-8 encoded character and this way of skipping over characters or calculating a string's length in characters is much faster than using bit operations and conditionals.
For some situations, e.g. if you can waste some memory and if you now that most characters you encounter will be from the Basic Multilingual Plane, you could try even more aggressive lookup tables, for example first calculate the length in bytes by the method described above and if it's 1 or 2 bytes (maybe 3 makes sense too), look up the decoded char in a table. Remember, however, to benchmark this and any other algorithm you try, as it need not be faster at all (bit operations are quite fast, and with a big lookup table you loose locality of reference plus the offset calculation isn't completely free, either).
Any way, I suggest you start by looking at the Javolution code or another similar library.
How would you go about calculating / finding the number of operations a regex takes to match over a given string? I'd like to develop a program that would allow you to rank regexs in order of efficiency.
Also, is it possible to break out of a regex if the number of operations exceeds a given threshold? I'm hoping to turn this into a web app, so I don't want users entering regexes that could potentially kill the server (if that's even possible).
Many thanks.
Edit: Just to clarify, I'm referring to the superset of plain regexes that includes backtracking (which is therefore non-linear).
The way to find out how many operations it will take to parse a given string is to parse it and count the number of operations. You could do somewhat limited static analysis, but a definitive answer would be tantamount to solving the halting problem.
Trying to rank expressions for any input is even more complex. Take the expression A[0-9]+
The string "A999" will match, and take roughly O(n) time.
The string "B943" will immediately fail, taking O(1) time.
A regular expression parser is fundamentally just a program. It is almost always not possible to say one program is faster than another in general, only for specific input.
You could try to use static analysis based on some understanding of what the input might be. For example, an expression which can immediately eliminate a large portion of the common inputs might be faster than one which doesn't. I would say that the only way to do this is to also accept a dataset of expressions with a similar distribution to those being parsed and either do benchmarks [easy] or analysis [hard] using that data.
I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.
If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?
Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).
Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.
Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.
POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)
[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]
[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]
I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.
It sounds like you're looking for collocation extraction. Manning and Schütze devote a chapter to the topic, explaining and evaluating the 'proposed formulas' mentioned in the Wikipedia article I linked to.
I can't fit the whole chapter into this response; hopefully some of their links will help. (NSP sounds particularly apposite.) nltk has a collocations module too, not mentioned by Manning and Schütze since their book predates it.
The other responses posted so far deal with statistical language processing and n-grams more generally; collocations are a specific subtopic.
Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.
For example you have this sentence.
mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;
This will give you values for two consecutive words.
You can do this word three words also. Beware this requires O(n^3) memory.
You can also use a heap for storing the data like:
heap['for example']++;
heap['example you']++;
One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA).
NFA
Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.