Elasticsearch schema for multiple versions of the same text - elasticsearch

I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including
The raw HTML
A cleaned-up text-only version
CoreNLP output of the article.
Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:
Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article : {html: '....', text: '....', CoreNLP: '....'}
Store each type of information in its own type: /articles/html/1, /articles/text/1, /articles/corenlp/1, etc...
Which one is more common? Is there a third, better option?

Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:
https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.

Related

ElasticSearch multiple languages

I am currently trying to figure out analysis schemes for my ElasticSearch cluster. I am using ES to index pdf, word, powerpoint and excel documents. I am using Apache Tika to extract the text.
My problem is that I do not know before hand what languages to expect the file contents to be. They could be written in any language.
My question is, is there a way to make ES analyze text regardless of the language? Or should I have a pre-defined field for each language with its own tokenizer, analyzer and stopwords?
I suggest taking a look at the ElasticSearch plugin elasticsearch-mapper-attachments. I used it to build document search functionality.
When it comes to supporting multiple languages, we have had the best experience with one index per language. If you can identify the language before indexing you can insert the document into the appropriate index. This makes it easier to add new languages vs. a field per language approach.
One thing to remember is the Don't use Types for Languages note at the bottom of one language per document page. Doing that can mess up search in a very difficult to debug way.
If you need to detect the language, there are two options mentioned at the bottom of the Pitfalls of Mixing Languages page.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Manually specifying how to build an index?

I'm looking into Thinking Sphinx for it's potential to solve an indexing problem. It looks like it has a very specific API for telling it what fields to index on a model. I don't like having this layer of abstraction in my way without being able to sidestep it. The thing is I don't trust Sphinx to be able to interpret my model properly as this model could have any conceivable property. Basically, I want to encode JSON in a RDBMS. In a way, I'm looking to make an RDBMS behave like MongoDB (RDBMSes have features I don't want to do without). If TS or some other index could be made to understand my models this could work. Is it possible to manually provide key/value pairs to TS?
"person.name.first" => "John", "person.name.last" => "Doe", "person.age" => 32,
"person.address" => "123 Main St.", "person.kids" => ["Ed", "Harry"]
Is there another indexing tool that could be used from Ruby to index JSON?
(By the way, I have explored a wide variety of NoSQL databases. I am trying to address a very specific set of requirements.)
As Matchu has pointed out in the comments, Sphinx usually interacts directly with the database. This is why Thinking Sphinx is built like it is.
However, Sphinx (but not Thinking Sphinx) can also accept XML data formats - so if you want to go down that path, feel free. You're going to have to understand the underlying Sphinx structure much more deeply than you would if using a normal relational database/ActiveRecord and Thinking Sphinx approach. Riddle may be useful for building a solution, but you'll still need to understand Sphinx itself first.
Basically, when you're specifying what you want to index--that is, when you want to build your own index--you're using the Map part of Map/Reduce. CouchDB supports exactly this. The only problem I ran into with Couch is that I want to query other document objects as the basis of my Map/Reduce since those documents would contain metadata about how I want to build my indexes. This goes against the grain of Map/Reduce however as you have to map a document in isolation with no external data. If you need external data it would instead be denormalized into your documents.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources