Need a tool to search large structure text documents for words, phrases and related phrases

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English
I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:
I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.
I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”
Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.
I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

Take a look at Lucene: and Solr , which can do most of what you ask. They are not exaclty featherweight though!
There is also this excellent book:

Opengrok is another lightweight solution on top of Lucene:
Alternatively, you could have a look at,
which is not lightweight solution but it is designed exactly for your purposes.


Tiny Elasticsearch like search library in ruby?

Is there an alternative to Elasticsearch? I am looking for something lightweight and tiny. Some sort of Ruby library would be ideal. For example, if I don't need to run an application to do the job but instead using that library I can search YAML or JSON files for matching text etc. And based on match score returns the sorted list.
Example here would make it clear.
Let's say I have a single YAML file with array of strings E.g
- “some nice text here”
- “this is another sentence”
- “some nice”
- “nice text here”
So now if I search for “some nice text here” the first line should come on top but also there is a slight similarity with the last two lines, so the search results should return as follows:
some nice text here
nice text here
some nice
Sphinx, Solr, come to mind but they're not small. Hell even using PostgreSQL until you really need to scale is fine (and you'll know when you need to scale) and it's probably already baked into your stack.
If your needs are to search YAML files, you could easily write your own naive approach
There's also but it might require some updating.

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

html and javascript tokenizer for lucene elasticsearch

I would like to tokenize html (not parse!) and javascript so that I can search the sourcecode we produce.
For example, a query of:
would return documents containing that.
Does anyone have information regarding a code tokenizer?
I'm not aware of any code analyzers, per se. Generally, if you want to index web content, you'dd use some other library to parse it and extract content, then analyze and index the extracted content.
But, looking for the shotgun approach, just chucking a bunch of raw code right into the index. Analysis won't be perfect no matter how you go about it, but it's a approximate effort anyway. I'dd probably go with PatternAnalyzer, for a first pass. I wouldn't even change the defaults. The default, \W+, means your tokens will be consecutive sequences of letters, numbers and underscores, which is about what is generally used for identifiers.
So, if you have:
<script src="jquery-1.11.min.js"></script>
You wind up with tokens like: script, src, jquery, 1, 11, min, js, script
This would be most useful for searching phrases, probably. So, for the search you indicated, you'll be using the same analysis, and be searching for a phrase of five consecutive terms: jquery, 1, 11, min, js. Which seems reasonable enough.
There are some significant weaknesses, of course. It would be impossible to differentiate between 2.11, 2*11 and 2+11, for instance. Something to bear in mind.

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
Specifically, look at the package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.
