Proper crawler architecture - how to using ElasticSearch ecosystem? - elasticsearch

In a v1.0 of a .Net data crawler, I created a Windows service that would read URLs from a database and based on some logic would select what to crawl on a specified interval.
This was single-threaded and worked well for a low number of endpoints, but scaling is obviously an issue.
I'm trying to find out how to do this using the ElasticSearch (ELK) stack and came across HTTPBeat,
a Beat to poll HTTP endpoints in a regular interval and ship the
result to the configured output channel, e.g. Logstash and
Elasticsearch.
In looking at the documentation, you have to add URLs to the config.yaml file. Not what I'm looking for as the list of URLs could change and we may not want all URLs crawled at the same time.
Then there's RSS for Logstash, which is a command-line tool - again, not what I'm looking for.
Is there a way to make use of the Beats daemon to read from the ElasticSearch database to do work based on database values - crawls, etc?
To take this to the enterprise level, do Beats or any other component of the ElasticSearch ecosystem use message queuing or a spooler (like FileBeats does - is this built into Beats?)?

Related

Should I use Java or Logstash to index db content in Elastic Search?

I'm building a simple web application that will list/search retail items for sale.
design is like this ...
MySQL database -> Elastic Search deployment -> Spring Boot REST service -> Web UI (JSP/Bootstrap or Angular)
I am planning to write Java client code to read the database and post records to Elastic Search for indexing.
Googling, it looks like Logstash is used for this sort of thing. I'm not familiar with Logstash, I am very familiar with Java.
QUESTION: Is Java client considered a "deprecated" or "legacy" way to submit data to Elastic Search for indexing?
I'm very familiar with Java, should I use Java or Logstash?
Adding to #chris answer, logstash will add complexity and another Infrastructure to maintain in your stack, and logstash is known for getting stuck and is not as resilient as Elasticsearch is.
You are already using Java for your application code and btw elasticsearch now officially has a java client known as java high-level rest client(JHLRC) , which is very popular and provides an exhaustive list of APIs for indexing/searching and building a modern search system.
IMHO you should use the JHLRC,
which will save you to the pain points of logstash
you don't have to learn another tool
simple infrastructure
simple deployment
last but not least simple and easy to maintain codebase.
Logstash is good tool to be used to migrate the data from many sources to elastic search. It's build in java language only.
You can use Logstash. It also has options to mutate the data or filter the data. Its a ready to use to tool which will save lot of your development time and efforts.
But if you have a requirement for lot of customisation and need lot of control over your data before pushing it to elastic search then you can build your own application for the same.
Coming back to your question..java is not deprecated for indexing data to elastic search. It is still a preferred option.

What is the use of FileBeat while parsing logs in Elasticsearch

I am not understanding the concept why file beat is required while we have a logstash.
With filebeat you are able to collect and forward logfiles from one or many remote servers.
There is also a option to add source specific fields to your log entries.
You have several output options like elasticsearch or logstash for further analysis/filtering/modification.
Just imagine 20 or 200 machines running services like databases, webservers, hosting applications and containers. And now you need to collect all the logs...
only with logstash you'll be pretty limited in this scenario
Beats are light-weight agents used primarily for forwarding events from multiple sources. Beats have a small footprint and use fewer system resources than Logstash.
Logstash has a larger footprint, but provides a broad array of input, filter, and output plugins for collecting, enriching, and transforming data from a variety of sources.
Please note though that filebeat is also capable of parsing for most use cases using Ingest Node as described here.

What are differences between Beats and jdbc plugin?

I am a newbie in the ElasticSearch's wonderful world so please be indulgent.
I am thinking about an import and synchronisation strategy for a Microsoft sql data source and if I did not misunderstand, I can use the input plugins JDBC or Beats.
But I don't see what are the deeps differences between them,
what are their usefulness? When use one or other one?
What are their benefits and their drawbacks?
Thank you if you can help me
They serve different purposes. Beats is another offering of the Elastic Stack, which is basically a platform for collecting and shipping data (logs, network packets, any kind of metrics, protocol data, etc) from the periphery of your architecture. Even though Beats also allows you to listen on the MySQL protocol and collect all kinds of metrics from your DB, it has nothing to do with loading data from your DB and load it into Elasticsearch. For that you can use the jdbc input plugin whose job is mainly to run a given query on regular time intervals and send each retrieved DB record as event through the Logstash pipeline to be processed further and sent to a variety of different outputs.

Elasticsearch: security concerns

We are using elasticsearch as back-end for our in-house logging and monitoring system. We have multiple sites pouring in data to one ES cluster but in different index. e.g. abc-us has data from US site, abc-india has it from India site.
Now concerns are we need some security checks before pushing in data to cluster.
data coming to index is coming from right IP address
incoming json request is of inserting new data and not delete/update
while reading we want certain IP should not be able to read data of other index.
Kindly let me know if its possible to achieve using elasticsearch.
The elasticsearch-jetty plugin brings full power of Jetty and adds several new features to elasticsearch. With this plugin elasticsearch can now handle SSL connections, support basic authentication, and log all or some incoming requests in plain text or json formats.
The idea is to add a Jetty wrapper to ElasticSearch, as a plugin.
What remains is only to restrict certain URL and some methods (eg DELETE) to some users.
You can find elasticsearch-jetty on github with detailed specification about it's usage, configuration and limitations of course.

Understanding Elasticsearch deployment in 2 server load balanced setup

We have a two server load balanced ASP.NET MVC application that we're look to add search to - we like the look of Elasticsearch!
I'd like to build a bit of resiliency in so am considering installing ES on both servers and have them work like a cluster (which seems straightforward enough according the docs).
But I really want my MVC app to talk to "the cluster" not to each specific ES node, so if an ES node fails, or a server fails the application is unaffected:
Do I need to refer to the cluster in a special way from my application or setup some kind of internal domain for it?
I don't want to refer to "localhost:9200" or a server specific IP in my application I presume?
You should consider using the Elasticsearch .NET Client NEST. This client has built in support for failover connections when accessing an Elasticsearch Cluster.
If you want a failover client instead of passing a Uri pass an IConnectionPool see the Elasticsearch.Net documentation on cluster failover all of its implementations can also be used with NEST.

Resources