what is going on inside of Nutch 2? - algorithm

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling).
I read this material but its a little hard to understand.
Is there anyone who can explain this to me in a complete and easy-to-understand way?
thanks in advance.

Short Answer
In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer).
You can start crawling the web without actually needing to know how they implemented it.
The page you reference describes how it is implemented.
Technology behind it
They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web.
I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites).
Take a look at the wikipedia page for MapReduce.
The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task, and then sends the sub-result back to Master.
Once the Master Node gets all the sub-results (or some of the sub-results) it starts to combine them (reduce them) into the final answer.
All of these tasks are done at the same time, and each computer is given the right amount of work to keep it occupied the whole time.
How to Crawl
Consists of 4 jobs:
Generate
Fetch
Parse
Update Database
*Generate
Start with a list of webpages containing the pages you want to start crawling from: The "Webtable".
The Master node sends all of the pages in that list to its slaves (but if two pages have the same domain they are sent to the same slave).
The Slave takes its assigned webpage(s) and:
Has this already been generated? If so, skip it.
Normalize the URL since "http://www.google.com/" and "http://www.google.com/../" is actually the same webpage.
return an initial score along with the webpage back to the Master.
(the Master partitions the webpages when it sends it to its slaves so that they all finish at the same time)
The Master now chooses the topN (maybe the user just wanted to start with 10 initial pages), and marks them as chosen in the webtable.
*Fetch
Master looks at each URL in the webtable, maps the ones which were marked onto slaves to process them.
Slaves fetch each URL from the Internet as fast as the internet connection will let them, they have a queue for each domain.
They return the URL along with the HTML text of the webpage to the Master.
*Parse
Master looks at each webpage in the webtable, if it is marked as fetched, it sends it to its slaves to parse it.
The slave first checks to see if it was already parsed by a different slave, if so skips it.
Otherwise, it parses the webpage and saves the result to webtable.
*Update Database
Master looks at each webpage in the webtable, sends the parsed rows to its slaves.
The slaves receive these Parsed URLs, calculate a score for them based on the number of links away from those pages (and the text near those links), and sends the Urls and scores back to the Master (which is sorted by score when it gets back to the Master because of the Partitioning).
The master calculates and updates the webpage scores based on the number of links to those pages from other ones.
The master stores this all to the database.
Repeat
When the pages were parsed, the links out of those webpages were added into the webtable. You can now repeat this process on just pages you haven't looked at yet to keep expanding your visited pages. Eventually you will reach most of the Internet after enough iterations of the four above steps.
Conclusion
MapReduce is a cool system.
A lot of effort has been applied to make it as efficient as possible.
They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.
The Master may decide to do the slaves' tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.
MapReduce lets you write simple code:
Define a Mapper, an optional Partitioner, and a Reducer.
Then let MapReduce figure out how best to do that with all the computer resources it has access to, even if it is a single computer with a slow internet connection, or a kila-cluster. (maybe even Mega-clusters).

Related

Scheduling tasks/messages for later processing/delivery

I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.

Distributed crawling and rate limiting / flow control

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?
Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Eventual consistency - how to avoid phantoms

I am new to the topic. Having read a handful of articles on it, and asked a couple of persons, I still do not understand what you people do in regard to one problem.
There are UI clients making requests to several backend instances (for now it's irrelevant whether sessions are sticky or not), and those instances are connected to some highly available DB cluster (may it be Cassandra or something else of even Elasticsearch). Say the backend instance is not specifically tied to one or cluster's machines, and instead its every request to DB may be served by a different machine.
One client creates some record, it's synchronously of asynchronously stored to one of cluster's machines then eventually gets replicated to the rest of DB machines. Then another client requests the list or records, the request ends up served by a distant machine not yet received the replicated changes, and so the client does not see the record. Well, that's bad but not yet ugly.
Consider however that the second client hits the machine which has the record, displays it in a list, then refreshes the list and this time hits the distant machine and again does not see the record. That's very weird behavior to observe, isn't it? It might even get worse: the client successfully requests the record, starts some editing on it, then tries to store the updates to DB and this time hits the distant machine which says "I know nothing about this record you are trying to update". That's an error which the user will see while doing something completely legitimate.
So what's the common practice to guard against this?
So far, I only see three solutions.
1) Not actually a solution but rather a policy: ignore the problem and instead speed up the cluster hard enough to guarantee that 99.999% of changes will be replicated on the whole cluster in, say, 0.5 secord (it's hard to imagine some user will try to make several consecutive requests to one record in that time; he can of course issue several reading requests, but in that case he'll probably not notice inconsistency between results). And even if sometimes something goes wrong and the user faces the problem, well, we just embrace that. If the loser gets unhappy and writes a complaint to us (which will happen maybe once a week or once an hour), we just apologize and go on.
2) Introduce an affinity between user's session and a specific DB machine. This helps, but needs explicit support from the DB, and also hurts load-balancing, and invites complications when the DB machine goes down and the session needs to be re-bound to another machine (however with proper support from DB I think that's possible; say Elasticsearch can accept routing key, and I believe if the target shard goes down it will just switch the affinity link to another shard - though I am not entirely sure; but even if re-binding happens, the other machine may contain older data :) ).
3) Rely on monotonic consistency, i.e. some method to be sure that the next request from a client will get results no older than the previous one. But, as I understand it, this approach also requires explicit support from DB, like being able so pass some "global version timestamp" to a cluster's balancer, which it will compare with it's latest data on all machines' timestamps to determine which machines can serve the request.
Are there other good options? Or are those three considered good enough to use?
P.S. My specific problem right now is with Elasticsearch; AFAIK there is no support for monotonic reads there, though looks like option #2 may be available.
Apache Ignite has primary partition for a key and backup partitions. Unless you have readFromBackup option set, you will always be reading from primary partition whose contents is expected to be reliable.
If a node goes away, a transaction (or operation) should be either propagated by remaining nodes or rolled back.
Note that Apache Ignite doesn't do Eventual Consistency but instead Strong Consistency. It means that you can observe delays during node loss, but will not observe inconsistent data.
In Cassandra if using at least quorum consistency for both reads and writes you will get monotonic reads. This was not the case pre 1.0 but thats a long time ago. There are some gotchas if using server timestamps but thats not by default so likely wont be an issue if using C* 2.1+.
What can get funny is since C* uses timestamps is things that occur at "same time". Since Cassandra is Last Write Wins the times and clock drift do matter. But concurrent updates to records will always have race conditions so if you require strong read before write guarantees you can use light weight transactions (essentially CAS operations using paxos) to ensure no one else updates between your read to update, these are slow though so I would avoid it unless critical.
In a true distributed system, it does not matter where your record is stored in remote cluster as long as your clients are connected to that remote cluster. In Hazelcast, a record is always stored in a partition and one partition is owned by one of the servers in the cluster. There could be X number of partitions in the cluster (by default 271) and all those partitions are equally distributed across the cluster. So a 3 members cluster will have a partition distribution like 91-90-90.
Now when a client sends a record to store in Hazelcast cluster, it already knows which partition does the record belong to by using consistent hashing algorithm. And with that, it also knows which server is the owner of that partition. Hence, the client sends its operation directly to that server. This approach applies on all client operations - put or get. So in your case, you may have several UI clients connected to the cluster but your record for a particular user is stored on one server in the cluster and all your UI clients will be approaching that server for their operations related to that record.
As for consistency, Hazelcast by default is strongly consistent distributed cache, which implies that all your updates to a particular record happen synchronously, in the same thread and the application waits until it has received acknowledgement from the owner server (and the backup server if backups are enabled) in the cluster.
When you connect a DB layer (this could be one or many different types of DBs running in parallel) to the cluster then Hazelcast cluster returns data even if its not currently present in the cluster by reading it from DB. So you never get a null value. On updating, you configure the cluster to send the updates downstream synchronously or asynchronously.
Ah-ha, after some even more thorough study of ES discussions I found this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Note how they specifically highlight the "custom value" case, recommending to use it exactly to solve my problem.
So, given that's their official recommendation, we can summarise it like this.
To fight volatile reads, we are supposed to use "preference",
with "custom" or some other approach.
To also get "read your
writes" consistency, we can have all clients use
"preference=_primary", because primary shard is first to get all
writes. This however will probably have worse performance than
"custom" mode due to no distribution. And that's quite similar to what other people here said about Ignite and Hazelcast.
Right?
Of course that's a solution specifically for ES. Reverting to my initial question which is a bit more generic, turns out that options #2 and #3 are really considered good enough for many distributed systems, with #3 being possible to achieve with #2 (even without immediate support for #3 by DB).

Detecting and recovering failed H2 cluster nodes

After going through H2 developer guide I still don't understand how can I find out what cluster node(s) was/were failing and which database needs to be recovered in the event of temporary network failure.
Let's consider the following scenario:
H2 cluster started with N active nodes (is actually it true that H2 can support N>2, i.e. more than 2 cluster nodes?)
(lots DB updates, reads...)
Network connection with one (or several) cluster nodes gets down and node becomes invisible to the rest of the cluster
(lots of DB updates, reads...)
Network link with previously disconnected node(s) restored
It is discovered that cluster node was probably missing (as far as I can see SELECT VALUE FROM INFORMATION_SCHEMA.SETTINGS WHERE NAME='CLUSTER' starts responding with empty string if one node in cluster fails)
After this point it is unclear how to find out what nodes were failing?
Obviously, I can do some basic check like comparing DB size, but it is unreliable.
What is the recommended procedure to find out what node was missing in the cluster, esp. if query above responds with empty string?
Another question - why urlTarget doesn't support multiple parameters?
How I am supposed to use CreateCluster tool if multiple nodes in the cluster failed and I want to recover more than one?
Also I don't understand how CreateCluster works if I had to stop the cluster and I don't want to actually recover any nodes? What's not clear to me is what I need to pass to CreateCluster tool if I don't actually need to copy database.
That is partially right SELECT VALUE FROM INFORMATION_SCHEMA.SETTINGS WHERE NAME='CLUSTER', will return an empty string when queried in standard mode.
However, you can get the list of servers by using Connection.getClientInfo() as well, but it is a two-step process. Paraphrased from h2database.com:
The list of properties returned by getClientInfo() includes a numServers property that returns the number of servers that are in the connection list. getClientInfo() also has properties server0..serverN, where N is the number of servers - 1. So to get the 2nd server from the list you use getClientInfo('server1').
Note: The serverX property only returns IP addresses and ports and not
hostnames.
And before you say simple replication, yes that is default operation, but you can do more advanced things that are outside the scope of your question in clustered H2.
Here's the quote for what you're talking about:
Clustering can only be used in the server mode (the embedded mode does not support clustering). The cluster can be re-created using the CreateCluster tool without stopping the remaining server. Applications that are still connected are automatically disconnected, however when appending ;AUTO_RECONNECT=TRUE, they will recover from that.
So yes if the cluster stops, auto_reconnect is not enabled, and you stick with the basic query, you are stuck and it is difficult to find information. While most people will tell you to look through the API and or manual, they haven't had to look through this one so, my sympathies.
I find it way more useful to track through the error codes, because you get a real good idea of what you can do when you see how the failure is planned for ... here you go.

Running web-fetches from within a Hadoop cluster

A blog post - http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html - suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.
For the system I'm currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter's API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing - with at least the same scale of requests.
Aside from potential side-effects to the external source (changing data so it's different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?
The plus: it's a super easy way to distribute the work that needs to be done.
The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn't run (which you can definitely do, it's just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers...what happens if half of the calls run, then the job fails, so it is rescheduled?
You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.

Resources