Multiple Pattern Match Algorithm - algorithm

I have lot of logs and every record contains a url. And I have about 2000+ url patterns to filter the log. Some patterns are regular pattern with capturable group. I want to get url and the matched pattern and, if possible, the captured groupes. Is there a java lib can help me. Or any Algorithm which can solve my problem. Or anyting else which related to my problem. Thanks a lot.

Take a look at java regular expressions library (link).
You can construct a single large pattern by concatenating your original patterns with | between them (use () to specify that you don't want just 1 character).
The regular expression can be compiled into an efficient matching finite automata, that you can run over your data. Just make sure you compile it once and reuse it for every record.
It will handle extracting groups, but you need to handle the groups in a generic way (since any group can be matched). If it makes it easier consider using named groups to make handling simpler.

Related

Is there some standard for handling sets of ranges of numbers?

When you open the print dialog of an editor, you typically get a field for specifying which pages you want to print - which can be multiple ranges, e.g.: "5,11,31-33"
Now, there are other scenarios in which this kind of input from a user is relevant - especially in configuration files for sequential or iterative processes where you want to qualify which iterations or elements a certain action or feature should apply to.
However, I'm not aware of a name for this kind of strings; nor of an accepted standard format/convention for them (i.e. can you add spaces? Can you use semicolons instead of commas? Must the ranges be sorted? Are overlaps allowed and are they maintained or discarded? Can ranges use ".." instead of "-"? Can you range down instead of up? etc.).
Is there some such convention or such standard?
My motivation is double: I need to parse such ranges in a piece of code I'm looking at, any I want both to do it correctly (or rather per-convention), and secondly to go look for parsing functionality in existing libraries. Right now I don't even have a name to go on.

Most efficient way of matching file names with Ruby regex

The method Dir.glob is used for achieving file names that match a certain pattern, but its argument has a Unix-like syntax (e.g., using *, ** as wild cards in a particular way, etc.). Instead, I want to use Ruby (Onigmo) regex for the matching pattern to do the same thing (using its wildcards, quantifiers, anchors, escaped characters, etc). What is the best way to do this?
One simple way that comes to mind is to use Dir.glob to get the list of all existing files in all directories, and filtering them using the regex, but that does not look efficient. Or, is it? Is there a better way?
You could try the Find module in Ruby's standard library.
require 'find'
Find.find(path).grep(/regex/)
The find method returns every path that exists within the path you provide as an argument recursively, pretty much like what you mentioned with Dir.glob. You can then use the built-in grep method to filter the results with a regex.
This may not be the most efficient method though, since Dir.glob is written in C while the Find module is written in Ruby. I did a test on my home directory and it took Find a little longer to get the result than Dir.glob, but you can also use the Find module's prune method in order to not descend into particular folders, which could help make things more efficient using Find.

Regular Expression for Address/Zip/City&State

Anybody have an example of a regular expression that matches for address, zip, or [city,state]?
Update:
Admittedly, this is a weak question because I don't have enough information regarding user behavior at this point to really qualify the parameters of the problem. Here is what I'm trying to do though:
Create a search function that depending on what information has been entered in chooses one of two divergent paths, the first being address proximity search and the second being organization name search.
It is proving a difficult problem to solve, so any input out there, besides .* (okay, okay I deserved that) would be much appreciated.
Check out geocoder (http://www.rubygeocoder.com/). It will get lat/long from text input. What you could do for your search is first try to match organization names, and then try to match locations.
Luckily google figure out how to do proximity searches a while ago

How to define aspell word delimiters?

Aspell considers words with underscores or dashes as two, e.g. cloud-based is spell checked as "cloud" and "based". Is there any way to specify the word delimiters so as to exclude dash and underscore?
If I understand the question correctly, Aspell cannot do exactly what you want (up to my knowledge). This has to do with conditional compound word treatment, which is on the Aspells TODO list.
On the same list it is mentioned that Hunspell does a better job with compound words, so it might be a viable alternative if you're not bound to Aspell.
OpenOffice uses Hunspell for spellchecking, so it is easy to find out whether it fits your requirements. It does, at least, work for the "cloud-based" example, and does NOT consider all hyphenated words unconditional compounds, i.e. "based-cloud" would not be considered a spelling error.
Aspell is unable to do what you want it to do at this point. The interface it uses for handling word with symbols in them is not sophisticated enough to handle such a case at this time. More information on this is listed here.
Sorry that this cannot be solved up to this point, unless you want to implement your own interface. I would recommend using Hunspell as Mikhail suggested.

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Resources