Crawlers other than Nutch that work with Elasticsearch - elasticsearch

I'm trying to get some suggestions as I setup my data system. I'd like to setup a system for web crawling. It'll crawl probably a few hundred/thousand sites on a regular basis.
I'm aware of Nutch and have used Nutch, however I'd like to know if others know of a better crawler than Nutch.
I'm also using Elasticsearch as the indexer and its quite hard to get Nutch to work with newer versions of ES.

You can take a look at StormCrawler is based on Apache Storm and is not only a full-featured crawler but also has a focus on Near Real Time crawling. ES is usually very updated, at the moment of this writing, supports ES v6.1.1 (https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/pom.xml#L20) so this could work you. Keep in mind that is a different approach & technologie than Nutch, although it uses some of the ideas behind Apache Nutch.
Also, in https://github.com/BruceDone/awesome-crawler you can find a list of a lot of crawlers written in a lot of different languages.

Related

Local indexing of rich text files

I am trying to create a local index for my notes which comprises mainly of markdown files, text files, codes in python, javascript and dart.
I came across Solr and Elasticsearch.
But the main differences are focused around online use and distributedness.
Which can be a better choice if i need a good integrarion with javascript through electronjs?
Keeping in mind that the files are on local storage and there is not much focus on distributedness but on integration with javascript frontend and efficiency on local system.
Elasticsearch is more popular among newer developers due to its ease of use. But if you are already used to working with Solr, stay with it because there is no specific advantage of migrating to Elasticsearch.
I believe for your use case either of them would work.
However, If you need it to handle analytical queries in addition to searching text, Elasticsearch is the better choice
In terms of popularity, a larger community, documentations I would say elasticsearch is the winner, You can look at the below google trends
You can use the solr along with Apache Tika.
Apache Tika help in extracting the content/Text of different file system.
Using the above the you can index the metadata of the files and content of the files to the Apache solr.
You get admin tool for the analysis of the index and the fields to determine if you able to achieve the desired result.

Nutch and Elasticsearch Integration

I'd like to know which versions of Nutch and Elasticsearch work well together to create a vertical search application (on AWS)?
If I plan on starting with 500 sites to crawl and increase from there, what are the best versions to use together.
I have Nutch 1.10 and ES 1.5 working together on my local machine for dev and testing purposes but I know as my data gets bigger (more sites crawled) this won't be feasible.
I'd like to use AWS EMR and store the crawled data on S3.
Ok, so after much searching, reading and watching some videos... its pretty clear that Nutch 2.x (2.3) is a good choice. It seems to be better suited going forward and will work with ES.
-HTH anyone else facing similar situation

How to load Kibana3 dashboards into Kibana4?

I have recently installed Kibana4 but I am beginning to understand that dashboards are designed differently from Kibana3 i.e., to embed multiple visualizations which are designed individually into every dashboard. I already have a lot of dashboards designed in Kibana3 so I would like to know if there is a way to load them to kibana4 instead of creating everything from scratch.
To the best I know, there is no way to do that. Not just the formats, but the queries sent to ES backend are quite different. Kibana 3 used to use facets a lot for segmentation which is a deprecated feature and Kibana4 got rid of that.

What is a good choice for Fulltext indexing when developing a OSX application?

Hy,
I'm implementing an IMAP client as a Mac OSX application using MacRuby.
For the sake of offline availability, I wanted to allow fulltext indexing and attribute based indexing of all messages. Attributes include common E-Mail stuff like from:, to:, etc...
This would allow for advanced results sprinkled with faceting, analytic calculations and such.
Now I'm unsure about the choices and good practices when it comes to integrating such a search feature. I have a strong web development background, therefore my intuitive action would be to setup a Solr server and start feeding it with data. This might just work theoretically, as I could write an Agent that manages the solr instance for my application in the background. But to me, this approach seems like an infrastructure hassle.
On the other side, I've read about people using the FTS3 functionality from SQLite. This approach is easily accessible by CoreData. I haven't used SQLite's FTS3 but I don't think it is as powerful as Solr can be.
What is your weapon of choice for a use case like mine?
I'm mainly interested in solutions that are actually in use by Objective-C/Cocoa/MacRuby developers.
In you're going to develop the app with Ruby give a try to picky. It is very simple to use.
There is an Objective-C Lucene port
http://svn.gna.org/viewcvs/etoile/trunk/Etoile/Frameworks/LuceneKit/
I have not used it, but in your situation, I'd at least check it out. In my experience, SQL based full text search can't compete with Lucene, but haven't tried SQLite for this.
EDIT: just noticed the ruby tag -- this started out as port of Lucene
https://github.com/dbalmain/ferret

Which schemaless datastores provide good performance?

I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couchapp, and all that's required to deploy is a replicate to the production server.
However for a new app I'm thinking off (think blog type thingy), I want good performance and it's one area I think couchdb is not strong in. The app will be predominantly read oriented (I'm estimating 90% reads to 10% writes).
Which datastores provide the best performance in a single server scenario? I'd be very interested to hear people's experiences in this...
I think MongoDB is beginning to look like the front runner performance wise for schemaless data stores.
We're currently in the processes of evaluating this for storing binary objects that can range from 10Kb to 50Mb and I've been very impressed with it's performance even on modest hardware.
If it is primarily read performance you are worried about why not just put a varnish proxy in front of couchdb? I use a couple of custom configurations in varnish to tell it not to actually query couchdb for cached objects despite couchdb specifying must-validate, then have a script with an active HTTP GET on _changes that uses the data from _changes in order to explicitly purge changed entries from varnish.
As a plus varnish lets you do URL rewriting, which I need. Most of the other solutions for it involve running something like apache or ngnix just to rewrite URLs for couchdb.

Resources