Nutch as stand-by spider with custom processing pipelines - hadoop

I would like to use Apache Nutch as a spider which only fetches given url list (no crawling). The urls are going to be stored in Redis and I want Nutch to take constantly pop them from the list and fetch html. The spider needs to be in stand-by mode - it always waits for the new urls coming into Redis until the user decides to stop the job. Also, I would like to apply my own processing pipelines to the extracted html files (not only text extraction). Is it possible to do with Nutch?

StormCrawler would be a much better fit for achieving this - it was designed to be able to cater for scenarios like the one you described. You'd need to write a custom spout t connect to redis, reuse the fetcher and parser bolts then add bolts with your own processing. Some of SC's early users were doing exactly that

Related

Flink web UI: Monitor Metrics doesn't work

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:
I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

Distributed crawling and rate limiting / flow control

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?
Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Using NiFi for scheduling Hadoop batch processes

According to NiFi's homepage, it "supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic".
I've been playing with NiFi in the last couple of months and can't help wondering why not using it also for scheduling batch processes.
Let's say I have a use case in which data flows into Hadoop, processed by a series of Hive \ MapReduce jobs, then exported to some external NoSql database to be used by some system.
Using NiFi in order to ingest and flow data into Hadoop is a use case that NiFi was built for.
However, using Nifi in order to schedule the jobs on Hadoop ("Oozie-like") is a use case I haven't encountered others implementing, and since it seems completely possible to implement, I'm trying to understand if there are reasons not to do so.
The gains of doing it all on NiFi is that one will get a visual representation of the entire course of data, from source to destination in one place. In case of complicated flows it is very important for maintenance.
In other words - my question is: Are there reasons not to use NiFi as a scheduler\coordinator for batch processes? If so - what problems may arise in such a use case?
PS - I've read this: "Is Nifi having batch processing?" - but my question aims to a different sense of "batch processing in NiFi" than the one raised in the attached question
You are correct that you would have a concise visual representation of all steps in the process were the schedule triggers present on the flow canvas, but NiFi was not designed as a scheduler/coordinator. Here is a comparison of some scheduler options.
Using NiFi to control scheduling feels like a "hammer" solution in search of a problem. It will reduce the ease of defining those schedules programmatically or interacting with them from external tools. Theoretically, you could define the schedule format, read them into NiFi from a file, datasource, endpoint, etc., and use ExecuteStreamCommand, ExecuteScript, or InvokeHTTP processors to kick off the batch process. This feels like introducing an unnecessary intermediate step however. If consolidation & visualization is what you are going for, you could have a monitoring flow segment ingest those schedule definitions from their native format (Oozie, XML, etc.) and display them in NiFi without making NiFi responsible for defining and executing the scheduling.

Running web-fetches from within a Hadoop cluster

A blog post - http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html - suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.
For the system I'm currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter's API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing - with at least the same scale of requests.
Aside from potential side-effects to the external source (changing data so it's different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?
The plus: it's a super easy way to distribute the work that needs to be done.
The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn't run (which you can definitely do, it's just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers...what happens if half of the calls run, then the job fails, so it is rescheduled?
You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.

Resources