Lucene query from Hadoop PIG jobs - hadoop

I have couple of thousands of customer names, alternative names, business names etc details indexed in Lucene indexes (indexes are not stored in HDFS).
I have massive amount (>100M) of person data in HDFS and I want to scan person data with Lucene indexes, I am currently using PIG for data processing from HDFS.
I am trying to find if it is possible to run PIG job which extracts data and in-parallel perform queries to Lucene indexes (may be by using custom written UDF), I am not able to think how Lucene local indexes are loaded and shared within PIG jobs (after Lucene query I need matched document IDs if match is found).
Is it possible using PIG ? or I need to write custom map-reduce jobs for this ? Or any other suggestions ?
Thanks.

You definitely need UDFs for that - elephant-bird's lucene loader is a good starting point.
Check it out at https://github.com/kevinweil/elephant-bird/tree/master/pig

Related

Solr HBase search engine

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too

Indexing logs with es-hadoop

I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html

Elasticsearch and Hive work together

I see that Hive and Elasticsearch are almost equivalent except that Elasticsearch supports near real time queries. Moreover, Elasticsearch can run independently to store and analyze data. So why people use both Hive and Elasticsearch on Hadoop ?
Hive and Elasticsearch are two really different tools.
Hive is a SQL to Hadoop Java translator to interact with virtually any datasource using SQL (including elasticsearch), using SerDe's. Hive can also store data using HDFS. Hive is really good at batch processing.
Elasticsearch is a distributed faceted search engine, it is very good to quickly retrieve data in millions of documents. It can also be used to make some simple calculations using facets.
Hive and ES are complementary, people use Hive to process data, and ES to deliver data / insights.

How can I copy hadoop data to SOLR

I've a SOLR search which uses lucene index as a backend.
I also have some data in Hadoop I would like to use.
How do I copy this data into SOLR ??
Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR.
I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.
How do I copy? And it would be great if there is some incremental copy mechanism.
If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.
I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.
As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth
You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here
EDIT 1
Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.
You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.
Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.
Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.
* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

LUCENE and Hadoop

i am using lucene for providing indexing and searching on text file.can i use HDFS for storing index file.
You interchange tasks: instead of thinking where to use Hadoop, first think what you need to implement your project. And if you see that you need Hadoop, it will become obvious where and how to use it.
One tip. Most probably you don't need neither Hadoop, nor even Lucene itself: Solr - search server created on top of Lucene - now has distributed setup, which is specifically designed for indexing and searching; Nutch may be used as front-end for Solr to crawl the web; and Tika may help you to parse all types of offline files.
Lucene comes into picture after all your data is ready in form of lucene documents ( lucene cache ).
Looks like you know Lucene already. The purpose of Hadoop is to reduce a big task into small chunks. I think first usage of Hadoop can be to gather data. Each hadoop node can keep collecting data; and create lucene documents

Resources