Distributed crawling and rate limiting / flow control - elasticsearch

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?

Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Related

Fetch third party data in a periodic interval

I've an application with 10M users. The application has access to the user's Google Health data. I want to periodically read/refresh users' data using Google APIs.
The challenge that I'm facing is the memory-intensive task. Since Google does not provide any callback for new data, I'll be doing background sync (every 30 mins). All users would be picked and added to a queue, which would then be picked sequentially (depending upon the number of worker nodes).
Now for 10M users being refreshed every 30 mins, I need a lot of worker nodes.
Each user request takes around 1 sec including network calls.
In 30 mins, I can process = 1800 users
To process 10M users, I need 10M/1800 nodes = 5.5K nodes
Quite expensive. Both monetary and operationally.
Then thought of using lambdas. However, lambda requires a NAT with an internet gateway to access the public internet. Relatively, it very cheap.
Want to understand if there's any other possible solution wrt the scale?
Without knowing more about your architecture and the google APIs it is difficult to make a recommendation.
Firstly I would see if google offer a bulk export functionality, then batch up the user requests. So instead of making 1 request per user you can make say 1 request for 100k users. This would reduce the overhead associated with connecting and processing/parsing of the message metadata.
Secondly i'd look to see if i could reduce the processing time, for example an interpreted language like python is in a lot of cases much slower than a compiled language like C# or GO. Or maybe a library or algorithm can be replaced with something more optimal.
Without more details of your specific setup its hard to offer more specific advice.

ELB Balancing Stateful Servers

Let's say i have this HTTP2 service, that has a list of users and this user hair color, in memory and database well.
Now i want to scale this up into multiple nodes - however i do not want the same user to be in two different servers memory - each server shall handle those specific users. This means i need to inform the load balancer where each user is being handled. In case of de-scaling, i need to inform this user is nowhere and can be routed to any server or by a given rule - IE server with less memory being used.
Would any1 know if ALB load balancer supports that ? One path i was thinking of using Query string parameter-based routing, so i could inform in the request itself something like destination_node = (int)user_id % 4 in case i had 4 nodes for instance - and this worked well in a proof of concept but that leads to a few issues:
The service itself would need to know how many instances there are to balance.
I could not guarantee even balancing, its basically a luck based balancing.
What would be the preferred approach for this, or what is a common way of solving this problem ? Does AWS ELB supports this out of the box ? I was trying to avoid having to write my own balancer, a middleware that keeps track of what services are handling what users, whose responsibility would be distributing the requests among those servers.
In AWS Application Load Balancer (ALB) it is possible to write Routing-Rules on
Host Header
HTTP Header
HTTP Request Method
Path Pattern
Query String
Source IP
But at the moment there is no way to route under dynamic conditions.
If it possible to group your data, i would prefere path pattern like
/users/blond/123

JMeter and page views

I'm trying to use data from google analytics for an existing website to load test a new website. In our busiest month over an hour we had 8361 page requests. So should I get a list of all the urls for these page requests and feed these to jMeter, would that be a sensible approach? I'm hoping to compare the page response times against the existing website.
If you need to do this very quickly, say you have less than an hour for scripting, in that case you can do this way to compare that there are no major differences between 2 instances.
If you would like to go deeper:
8361 requests per hour == 2.3 requests per second so it doesn't make any sense to replicate this load pattern as I'm more than sure that your application will survive such an enormous load.
Performance testing is not only about hitting URLs from list and measuring response times, normally the main questions which need to be answered are:
how many concurrent users my application can support providing acceptable response times (at this point you may be also interested in requests/second)
what happens when the load exceeds the threshold, what types of errors start occurring and what is the impact.
does application recover when the load gets back to normal
what is the bottleneck (i.e. lack of RAM, slow DB queries, low network bandwidth on server/router, whatever)
So the options are in:
If you need "quick and dirty" solution you can use the list of URLs from Google Analytics with i.e. CSV Data Set Config or Access Log Sampler or parse your application logs to replay production traffic with JMeter
Better approach would be checking Google Analytics to identify which groups of users you have and their behavioral patterns, i.e. X % of not authenticated users are browsing the site, Y % of authenticated users are searching, Z % of users are doing checkout, etc. After it you need to properly simulate all these groups using separate JMeter Thread Groups and keep in mind cookies, headers, cache, think times, etc. Once you have this form of test gradually and proportionally increase the number of virtual users and monitor the correlation of increasing response time with the number of virtual users until you hit any form of bottleneck.
The "sensible approach" would be to know the profile, the pattern of your load.
For that, it's excellent you're already have these data.
Yes, you can feed it as is, but that would be the quick & dirty approach - while get the data analysed, patterns distilled out of it and applied to your test plan seems smarter.

what is going on inside of Nutch 2?

I eager to know (and have to know) about the nutch and its algorithms (because it relates to my project) that it uses to fetch,classify,...(generally Crawling).
I read this material but its a little hard to understand.
Is there anyone who can explain this to me in a complete and easy-to-understand way?
thanks in advance.
Short Answer
In short, they have developed a webcrawler designed to very efficiently crawl the web from a many computer environment (but which can also be run on a single computer).
You can start crawling the web without actually needing to know how they implemented it.
The page you reference describes how it is implemented.
Technology behind it
They make use of Hadoop which is an open source java project which is designed along the same lines of MapReduce. MapReduce is the technology Google uses to crawl and organize the web.
I've attended a few lectures on MapReduce/Hadoop, and unfortunately, I don't know if anyone at this time can explain it in a complete and easy-to-understand way (they're kind of opposites).
Take a look at the wikipedia page for MapReduce.
The basic idea is to send a job to the Master Node, the Master breaks the work up into pieces and sends it (maps it) to the various Worker Nodes (other computers or threads) which perform their assigned sub-task, and then sends the sub-result back to Master.
Once the Master Node gets all the sub-results (or some of the sub-results) it starts to combine them (reduce them) into the final answer.
All of these tasks are done at the same time, and each computer is given the right amount of work to keep it occupied the whole time.
How to Crawl
Consists of 4 jobs:
Generate
Fetch
Parse
Update Database
*Generate
Start with a list of webpages containing the pages you want to start crawling from: The "Webtable".
The Master node sends all of the pages in that list to its slaves (but if two pages have the same domain they are sent to the same slave).
The Slave takes its assigned webpage(s) and:
Has this already been generated? If so, skip it.
Normalize the URL since "http://www.google.com/" and "http://www.google.com/../" is actually the same webpage.
return an initial score along with the webpage back to the Master.
(the Master partitions the webpages when it sends it to its slaves so that they all finish at the same time)
The Master now chooses the topN (maybe the user just wanted to start with 10 initial pages), and marks them as chosen in the webtable.
*Fetch
Master looks at each URL in the webtable, maps the ones which were marked onto slaves to process them.
Slaves fetch each URL from the Internet as fast as the internet connection will let them, they have a queue for each domain.
They return the URL along with the HTML text of the webpage to the Master.
*Parse
Master looks at each webpage in the webtable, if it is marked as fetched, it sends it to its slaves to parse it.
The slave first checks to see if it was already parsed by a different slave, if so skips it.
Otherwise, it parses the webpage and saves the result to webtable.
*Update Database
Master looks at each webpage in the webtable, sends the parsed rows to its slaves.
The slaves receive these Parsed URLs, calculate a score for them based on the number of links away from those pages (and the text near those links), and sends the Urls and scores back to the Master (which is sorted by score when it gets back to the Master because of the Partitioning).
The master calculates and updates the webpage scores based on the number of links to those pages from other ones.
The master stores this all to the database.
Repeat
When the pages were parsed, the links out of those webpages were added into the webtable. You can now repeat this process on just pages you haven't looked at yet to keep expanding your visited pages. Eventually you will reach most of the Internet after enough iterations of the four above steps.
Conclusion
MapReduce is a cool system.
A lot of effort has been applied to make it as efficient as possible.
They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.
The Master may decide to do the slaves' tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.
MapReduce lets you write simple code:
Define a Mapper, an optional Partitioner, and a Reducer.
Then let MapReduce figure out how best to do that with all the computer resources it has access to, even if it is a single computer with a slow internet connection, or a kila-cluster. (maybe even Mega-clusters).

Running web-fetches from within a Hadoop cluster

A blog post - http://petewarden.typepad.com/searchbrowser/2011/05/using-hadoop-with-external-api-calls.html - suggests calling external systems (querying the twitter API, or crawling webpages) from within a Hadoop cluster.
For the system I'm currently developing, there are both fast, and slow(bulk) sub-systems. Data is fetched from Twitter's API -also for quick, individual retrievals. This can be hundreds of thousands (even millions) of external requests per day. The content of web pages are also retrieved for further processing - with at least the same scale of requests.
Aside from potential side-effects to the external source (changing data so it's different on the next request), what would be the pluses, or minuses of using Hadoop in such a way? Is it a valid and useful method of bulk, and/or fast retrieval of data?
The plus: it's a super easy way to distribute the work that needs to be done.
The minus: due to the way that Hadoop recovers from failures, you need to be very careful about managing what is and isn't run (which you can definitely do, it's just something to watch out for). If a reduce fails, for example, then all of the map jobs that feed that partition must also be rerun. Obviously this would most likely be a no-reducer job, but this is still true of mappers...what happens if half of the calls run, then the job fails, so it is rescheduled?
You could use some sort of high-throughput system to manage the calls that are actually made or somesuch. But it definitely can be appropriately used for this.

Resources