Manually specifying how to build an index? - ruby

I'm looking into Thinking Sphinx for it's potential to solve an indexing problem. It looks like it has a very specific API for telling it what fields to index on a model. I don't like having this layer of abstraction in my way without being able to sidestep it. The thing is I don't trust Sphinx to be able to interpret my model properly as this model could have any conceivable property. Basically, I want to encode JSON in a RDBMS. In a way, I'm looking to make an RDBMS behave like MongoDB (RDBMSes have features I don't want to do without). If TS or some other index could be made to understand my models this could work. Is it possible to manually provide key/value pairs to TS?
"person.name.first" => "John", "person.name.last" => "Doe", "person.age" => 32,
"person.address" => "123 Main St.", "person.kids" => ["Ed", "Harry"]
Is there another indexing tool that could be used from Ruby to index JSON?
(By the way, I have explored a wide variety of NoSQL databases. I am trying to address a very specific set of requirements.)

As Matchu has pointed out in the comments, Sphinx usually interacts directly with the database. This is why Thinking Sphinx is built like it is.
However, Sphinx (but not Thinking Sphinx) can also accept XML data formats - so if you want to go down that path, feel free. You're going to have to understand the underlying Sphinx structure much more deeply than you would if using a normal relational database/ActiveRecord and Thinking Sphinx approach. Riddle may be useful for building a solution, but you'll still need to understand Sphinx itself first.

Basically, when you're specifying what you want to index--that is, when you want to build your own index--you're using the Map part of Map/Reduce. CouchDB supports exactly this. The only problem I ran into with Couch is that I want to query other document objects as the basis of my Map/Reduce since those documents would contain metadata about how I want to build my indexes. This goes against the grain of Map/Reduce however as you have to map a document in isolation with no external data. If you need external data it would instead be denormalized into your documents.

Related

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

Elasticsearch schema for multiple versions of the same text

I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including
The raw HTML
A cleaned-up text-only version
CoreNLP output of the article.
Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:
Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article : {html: '....', text: '....', CoreNLP: '....'}
Store each type of information in its own type: /articles/html/1, /articles/text/1, /articles/corenlp/1, etc...
Which one is more common? Is there a third, better option?
Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:
https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.

a simple filtering language that can be embedded in ruby?

I have a ruby project where part of the operation is to select entities given user-specified constraints. So far, I've been hacking my own filter language, using regular expressions and specifying inclusion/exclusion based on the fields in the entities.
If you are interested in my current approach, here's an example: For instance, given this list of entities:
[{"type":"dog", "name":"joe"}, {"type":"dog", "name":"fuzz"}, {"type":"cat", "name":"meow"}]
A user could specify a filter like so:
{"filter":{
"type":{"included":["dog"] },
"name":{"excluded":["^f.*"] }
}}
Would match all dogs but exclude fuzz.
This is sort of working now. However, I am starting to require more sophisticated selection parameters. I am thinking that rather than continuing to hack on my own filter language, there might be a more general-purpose filter language I can just embed in my application? For instance, is there a parser that can in-app filter using a SQL where clause? Or are there some other general, simple filter languages that I'm not aware of? I would especially like to move away from regexps since I want to do range querying on numbers (like is entity["size"] < 50 ?)
It is a little bit of an extrapolation, but I think you may be looking for a search engine, or at least enough of one that you may as well use one just for the query language.
If so you might want to look at elasticsearch which does have Ruby client bindings, and could be a good fit for what you are trying to do. Especially if you want or need to express the data you want to search as JSON for use by client code, as that format is natively supported by the search engine.
The query language is quite expressive, and there are a variety of built-in and plugin tools available to explore and use it.
in the end, i ended up implementing a ruby dsl. it's easy, fun, and powerful.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources