Local indexing of rich text files

Local indexing of rich text files - elasticsearch

I am trying to create a local index for my notes which comprises mainly of markdown files, text files, codes in python, javascript and dart.
I came across Solr and Elasticsearch.
But the main differences are focused around online use and distributedness.
Which can be a better choice if i need a good integrarion with javascript through electronjs?
Keeping in mind that the files are on local storage and there is not much focus on distributedness but on integration with javascript frontend and efficiency on local system.

Elasticsearch is more popular among newer developers due to its ease of use. But if you are already used to working with Solr, stay with it because there is no specific advantage of migrating to Elasticsearch.
I believe for your use case either of them would work.
However, If you need it to handle analytical queries in addition to searching text, Elasticsearch is the better choice
In terms of popularity, a larger community, documentations I would say elasticsearch is the winner, You can look at the below google trends

You can use the solr along with Apache Tika.
Apache Tika help in extracting the content/Text of different file system.
Using the above the you can index the metadata of the files and content of the files to the Apache solr.
You get admin tool for the analysis of the index and the fields to determine if you able to achieve the desired result.

Related

Is using Elastic Search as authoritative datastore for applications advisable?

I'm new to using elastic search, and I'm trying to find a datastore for our application where we can also add a front end for analytics, in this case Kibana. I'm planning to use them as a datastore for dr/cr transactions on our billing system.
Most use case I read is towards data analytics and searching related. I don't see a use case wherein it is used as a regular datastore for an application. So I'm worried I might use it on a wrong use case.
I was hoping if anyone can add their insights on this. Like why or why not use Elastic Search as authoritative/primary datastore for applications.

You should read a official blog of elasticsearch, where they clearly mentioned that databases must be robust and should not stop working unless you tell to do it.
From the robustness section of same blog
A database should be robust, especially if it is your authoritative
system of record. Ideally, a costly query should be possible to
cancel, and you certainly don't want the database to stop working
unless you tell it to.
Unfortunately, Elasticsearch (and the components it's made of) does
not currently handle OutOfMemory-errors very well. We cover this in
more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes.
It is very important to provide Elasticsearch with enough memory and
be careful before running searches with unknown memory requirements on
a production cluster.
In short, you shouldn't use Elasticsearch as a primary data-store where you can't afford to loose the data.

Crawlers other than Nutch that work with Elasticsearch

I'm trying to get some suggestions as I setup my data system. I'd like to setup a system for web crawling. It'll crawl probably a few hundred/thousand sites on a regular basis.
I'm aware of Nutch and have used Nutch, however I'd like to know if others know of a better crawler than Nutch.
I'm also using Elasticsearch as the indexer and its quite hard to get Nutch to work with newer versions of ES.

You can take a look at StormCrawler is based on Apache Storm and is not only a full-featured crawler but also has a focus on Near Real Time crawling. ES is usually very updated, at the moment of this writing, supports ES v6.1.1 (https://github.com/DigitalPebble/storm-crawler/blob/master/external/elasticsearch/pom.xml#L20) so this could work you. Keep in mind that is a different approach & technologie than Nutch, although it uses some of the ideas behind Apache Nutch.
Also, in https://github.com/BruceDone/awesome-crawler you can find a list of a lot of crawlers written in a lot of different languages.

How to load Kibana3 dashboards into Kibana4?

I have recently installed Kibana4 but I am beginning to understand that dashboards are designed differently from Kibana3 i.e., to embed multiple visualizations which are designed individually into every dashboard. I already have a lot of dashboards designed in Kibana3 so I would like to know if there is a way to load them to kibana4 instead of creating everything from scratch.

To the best I know, there is no way to do that. Not just the formats, but the queries sent to ES backend are quite different. Kibana 3 used to use facets a lot for segmentation which is a deprecated feature and Kibana4 got rid of that.

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!

As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.

I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

Design Question for Notification System

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr

Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).

You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio