Tiny Elasticsearch like search library in ruby? - ruby

Is there an alternative to Elasticsearch? I am looking for something lightweight and tiny. Some sort of Ruby library would be ideal. For example, if I don't need to run an application to do the job but instead using that library I can search YAML or JSON files for matching text etc. And based on match score returns the sorted list.
Example here would make it clear.
Let's say I have a single YAML file with array of strings E.g
values:
- “some nice text here”
- “this is another sentence”
- “some nice”
- “nice text here”
So now if I search for “some nice text here” the first line should come on top but also there is a slight similarity with the last two lines, so the search results should return as follows:
some nice text here
nice text here
some nice
Thanks

Sphinx, Solr, come to mind but they're not small. Hell even using PostgreSQL until you really need to scale is fine (and you'll know when you need to scale) and it's probably already baked into your stack.
If your needs are to search YAML files, you could easily write your own naive approac
There's also https://github.com/mezis/fuzzily but it might require some updating.

Related

How does file convertors work in general like word to pdf, XML to json, word to txt etc

I've used many types of file convertor like word to pdf, XML to json, word to txt etc.
How do they work in backend? Is there some specific guidelines each of them follow? Are there some similarity in the way they are implemented.
I tried searching it but most of the articles take me to the web app that can convert the doc, but none of them gives clarity on how it's done.
All of them work by parsing the first document into a data structure. Then generate a document in the other format from that data structure using recursion.
Parsing itself is a giant topic that people take courses on in computer science. But long story short, it proceeds by breaking the document into tokens, and then fitting the tokens into a parse tree using one of a standard set of methods. They have all sorts of fancy names like Recursive Descent and LALR(1). That's where most of the theory you'd want to learn is.
For example if you're writing a JSON to XML converter, you'd first need to parse that JSON. A JSON Parser shows how you could write that, from scratch, using recursive descent. Once written you just need to write a recursive function that takes each data type and does something appropriate with it to generate text in the format that you want.
Incidentally you can also write a "document converter" that converts from a document format to the same document format. Why would someone want to do that? The two most common use cases are to prettify or minify code. Despite the fact that only one format is being dealt with, the principles of how you do it are exactly the same.

Partial Indexing of an XML file (Bleve)

I am evaluating a couple different libraries to see which one will best fit what I need.
Right now I am looking at Bleve, but I am happy to use any library.
I am looking to index full files except specific ones which are in XML format. For those I only want Bleve to index specific tags as most of the tags are worthless to search. I am trying to evaluate if this is possible but, being new to Bleve, I am not sure what part I need to customize.
The documentation is very good, but I can't seem to find this answer. All I need is an explanation with keywords and steps, no code is required, I just need a push as I have spent hours spinning my wheels with google searches and I am getting no where.
There are probably many ways to approach this. Here's one.
Bleve indexes documents which are collections of key/value metadata pairs.
In your case, a document could be represented by 2 key/value pairs: name of .xml file (to uniquely identify the document) and content of the file.
type Doc struct {
Name string
Body string
}
The issue is that body is XML and Bleve doesn't support XML out-of-the-box.
A way to address it would be to pre-process XML file by stripping unwanted tags and content. You can do it using encoding/xml standard library.
For an example of a similar task you can see the code of https://github.com/blevesearch/fosdem-search/
In there they index file in custom format (https://github.com/blevesearch/fosdem-search/blob/master/fosdem.ical) by parsing it into a format they can submit to Bleve for indexing (https://github.com/blevesearch/fosdem-search/blob/master/ical.go).

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Need a tool to search large structure text documents for words, phrases and related phrases

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English
I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:
I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.
I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”
Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.
I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.
Take a look at Lucene: http://lucene.apache.org/ and Solr http://lucene.apache.org/solr/ , which can do most of what you ask. They are not exaclty featherweight though!
There is also this excellent book:
http://www.amazon.com/Building-Search-Applications-Lucene-Lingpipe/dp/0615204252/
Opengrok is another lightweight solution on top of Lucene: http://opengrok.github.io/OpenGrok/
Alternatively, you could have a look at http://www.alfresco.com,
which is not lightweight solution but it is designed exactly for your purposes.

Resources