Architecture - How to efficiently crawl the web with 10,000 machine? - algorithm

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?
Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?

Use a distributed Map Reduction system like Hadoop to divide the workspace.
If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.

One Idea would be to use work queues (directories or DB), assuming you will be working out storage such that it meets your criteria for redundancy.
\retrieve
\retrieve\server1
\retrieve\server...
\retrieve\server10000
\in-process
\complete
1.) All pages to be seeds will be hashed and be placed in the queue using the hash as a file root.
2.) Before putting in the queue you check the complete and in-process queues to make sure you don't re-queue
3.) Each server retrieves a random batch (1-N) files from the retrieve queue and attempts to move it to the private queue
4.) Files that fail the rename process are assumed to have been “claimed” by another process
5.) Files that can be moved are to be processed put a marker in in-process directory to prevent re-queuing.
6.) Download the file and place it into the \Complete queue
7.) Clean file out of the in-process and server directories
8.) Every 1,000 runs check the oldest 10 in-process files by trying to move them from their server queues back into the general retrieve queue. This will help if a server hangs and also should load balance slow servers.
For the Retrieve, in-process and complete servers most file systems hate millions of files in 1 directory, Divide storage into segments based on the characters of the hash \abc\def\123\ would be the directory for file abcdef123FFFFFF…. If you were scaling to billions of downloads.
If you are using a mongo DB instead of a regular file store much of these problems would be avoided and you could benefit from the sharding etc…

Related

SphinxSearch - Different Nodes using shared data

We are in the process of building a SphinxSearch Cluster using Amazon EC2 instances. We did a sample test like several instances using the same shared file system (Elastic File System). Our idea is, in a cluster we might have more than 10 nodes, But we can use a single instance to index documents and keep it in Elastic File System and can shared by multiple nodes for reading.
Our test worked fine, But technically any problem with this approach? (Like locking issue etc)
Can someone please suggest on this
Thanks in Advance
If you're ok with having N copies of the index you can do as follows:
build an index in one place in a temp folder
rename the files so they include .new.
distribute the index to all the other places using rsync or whatever you like. Some even do broadcasting with UFTP
rotate the indexes at once in all the places by sending HUP to the searchds or better by doing RELOAD INDEX (http://docs.manticoresearch.com/latest/html/sphinxql_reference/reload_index_syntax.html), it normally takes only few ms so we can say that your new index replaces the previous one simultaneously on all the nodes
previously (and perhaps still in Sphinx) there was an issue with rotating the index (either by --rotate or RELOAD) in case it was processing a long query (the rotate just had to wait). It was fixed in Manticoresearch recently.
This is tried'n'true solution people use in production for years, but if you really want to share the same files among multiple searchd instances you can softlink all the files except .spl, but then to rotate the index in the searchd instances using the links (not the actual files) you'll need to restart the searchd instances which doesn't look good in general, but in some special cases may be still a good solution.

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

Hadoop Distributed Cache - modify file

I have a file in the distributed cache. The driver class, based on the output of a job, updates this file and starts a new job. The new job need these updates.
The way I currently do it is to replace the old Distributed Cache file with a new one (the updated one).
Is there a way of broadcasting the diffs (between the old file and the new one) to all the tasks trackers which need the file ?
Or is it the case that, after a job (the first one, in my case) is finished, all the directories/files specific to that job are deleted and consequently it doesn't even make sense to think in this direction ?
I think that distributed cache is not build with such scenario in mind. It simply put files locally.
In Your case I would suggest to put file in HDFS and make all interested parties to take it from there
As an optimization you can give this file high replication factor and it will be local to most of the tasks.

Flat or nested directory structure for an image cache?

My Mac app keeps a collection of objects (with Core Data), each of which has a cover image, and to which I assign a UUID upon creation. I had originally been storing the cover images as a field in my Core Data store, but recently started storing them on disk in the file system, instead.
Initially, I'm storing the covers in a flat directory, using the UUID to name the file, as below. This gives me O(1) fetching, as I know exactly where to look.
...
/.../Covers/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
I've looked at the way other applications handle this task, though, and noticed a multi-level scheme, as below (for instance). This could still be implemented in O(1) time.
...
/.../Covers/A/B/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/C/D/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
What might be the reason to do it this way? Does OS X limit the number of files in a directory? Is it in some way faster to retrieve them from disk? It would make the code used to calculate the file's name more complicated, so I want to find out if there is a good reason to do it that way.
On certain file systems (and I beleive HFS+ too), having too many files in the same directory will cause performance issues.
I used to work in an ISP where they would break up the home directories (they had 90k+ of them) Using a multi-directory scheme. You can partition your directories by using, say, the first two characters of the UUID, then the second two, eg:
/.../Covers/3B/72/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6B/EC/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
That way you don't need to calculate any extra characters or codes, just use the ones you have already to break it up. Since your UUIDs will be different every time, this should suffice.
The main reason is that in the latter way, as you've mentioned, disk retrieval is faster because your directory is smaller (so the FS will lookup in a smaller table for a file to exists).
As others mentioned, on some file systems it takes longer for the OS to open the file, because one directory with many files is longer to read than a couple of short directories.
However, you should perform measurements on your particular file system and for your particular usage scenario. I did this for NTFS on Windows XP and was surprised to discover that flat directory was performing better in all kinds of tests, than hierarchical structure.

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Resources