Nutch and Elasticsearch Integration - elasticsearch

I'd like to know which versions of Nutch and Elasticsearch work well together to create a vertical search application (on AWS)?
If I plan on starting with 500 sites to crawl and increase from there, what are the best versions to use together.
I have Nutch 1.10 and ES 1.5 working together on my local machine for dev and testing purposes but I know as my data gets bigger (more sites crawled) this won't be feasible.
I'd like to use AWS EMR and store the crawled data on S3.

Ok, so after much searching, reading and watching some videos... its pretty clear that Nutch 2.x (2.3) is a good choice. It seems to be better suited going forward and will work with ES.
-HTH anyone else facing similar situation

Related

Local indexing of rich text files

I am trying to create a local index for my notes which comprises mainly of markdown files, text files, codes in python, javascript and dart.
I came across Solr and Elasticsearch.
But the main differences are focused around online use and distributedness.
Which can be a better choice if i need a good integrarion with javascript through electronjs?
Keeping in mind that the files are on local storage and there is not much focus on distributedness but on integration with javascript frontend and efficiency on local system.
Elasticsearch is more popular among newer developers due to its ease of use. But if you are already used to working with Solr, stay with it because there is no specific advantage of migrating to Elasticsearch.
I believe for your use case either of them would work.
However, If you need it to handle analytical queries in addition to searching text, Elasticsearch is the better choice
In terms of popularity, a larger community, documentations I would say elasticsearch is the winner, You can look at the below google trends
You can use the solr along with Apache Tika.
Apache Tika help in extracting the content/Text of different file system.
Using the above the you can index the metadata of the files and content of the files to the Apache solr.
You get admin tool for the analysis of the index and the fields to determine if you able to achieve the desired result.

Crawlers other than Nutch that work with Elasticsearch

I'm trying to get some suggestions as I setup my data system. I'd like to setup a system for web crawling. It'll crawl probably a few hundred/thousand sites on a regular basis.
I'm aware of Nutch and have used Nutch, however I'd like to know if others know of a better crawler than Nutch.
I'm also using Elasticsearch as the indexer and its quite hard to get Nutch to work with newer versions of ES.
You can take a look at StormCrawler is based on Apache Storm and is not only a full-featured crawler but also has a focus on Near Real Time crawling. ES is usually very updated, at the moment of this writing, supports ES v6.1.1 (https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/pom.xml#L20) so this could work you. Keep in mind that is a different approach & technologie than Nutch, although it uses some of the ideas behind Apache Nutch.
Also, in https://github.com/BruceDone/awesome-crawler you can find a list of a lot of crawlers written in a lot of different languages.

Cassandra and advanced queries: Spark, ElasticSearch, Sorl

Ok, so, I'm developing an app and I'm using Cassandra as the database.
Everything going good so far, but now I need to do a query using the LIKE clause.
I know Cassandra doesn't support that, and that's why after looking for a workaround I was thinking in maintaining this single table that I need to query using the LIKE clause in another database, other than Cassandra - was even considering a relational database, even though there wouldn't exist any relations.
Then I started looking to see if this is really the right approach, and came into stuff like Spark, Sorl and ElasticSearch.
Just to make it clear: I have little to no knowledge about those frameworks. Really. I only have heard about them and that's all.
So, I'm not here to ask you guys 'hey, how to do that using this framework?'. I just want to know, before I dig into any of those: Would any of those satisfy my needs? - Since I have no idea exactly how they work, and what exactly they are for.
If it is the case, them I'll study the framework properly - I just don't want to spend the time to figure out it has nothing to do with my problem.
Thanks!
Both elasticsearch and solr fits your needs. They use lucene library to perform reverse indexing and much more -- Datastax enterprise (commercial distribution of Cassandra) offer this solution integrating solr natively. One more solution (little different but working) is to integrate infinispan which offers both integration with Cassandra repository and reverse indexing ...
HTH,
Carlo

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Elasticsearch deployment environment setup

We are working on setting up our elasticsearch backend for a production environment. Up until a few weeks ago, we were using Solr, but we decided to use Elasticsearch for a few reasons, but the biggest reason is for the distributed nature of the backend.
With that said, we've been looking for some documentation and best practices on deploying elasticsearch using amazon's services.
For the moment, we were considering using a extra-large box and then scaling out from there, but we aren't sure that is the best approach. For example, it may be better to have three mediums than one extra-large.
We intend to index around 100K to 150K documents per day up to around ten million docs.
The question is, can anyone provide a general environment / deployment diagram for elasticsearch or best practices in general?
There's some docs for elasticsearch that talk about EC2 deployment. There's an autodiscovery plugin based on EC2 tags or security groups or whatever you like. You can also choose S3 for persistence, although that may not really be necessary.
I'd advise launching it in a VPC so you can have permanent internal IPs, in regular EC2 your internal IPs will change with every reboot even if you're using Elastic IPs.

Resources