How a Search Script like InoutScripts' Inout Spider attains scalability? [closed] - hadoop

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to know more about how these Search Engines scripts like InoutScripts' Inout Spider attains scalability.
Is it because of the technology they are using.
Do u think is it because of the technology of combining hadoop and hypertable.

For storing and large scale processing of data sets on clusters of commodity hardware, Hadoop is used which is an open source software framework. Hadoop is an Apache top level project being built and used by a global community of contributors and users. At the application layer itself Hadoop detects and handle failures rather than rely on hardware to deliver high availability.

Related

How to build a powerful crawler like google's? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to build a crawler which can update hundreds of thousands of links in several minutes.
Is there any mature ways to do the scheduling?
Is distributed system needed?
What is the greatest barrier that limits the performance?
Thx.
For Python you could go with Frontera by Scrapinghub
https://github.com/scrapinghub/frontera
https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst
They're the same guys that make Scrapy.
There's also Apache Nutch which is a much older project.
http://nutch.apache.org/
You would need a distributed crawler but don't reinvent the wheel, use Apache Nutch. it was built exactly for that purpose, is mature and stable and used by a wide community to deal with large scale crawls.
The amount of processing and memory required would need distributed processing unless you are willing to compromise speed. Remember you'd be dealing with billions of links and terabytes of text and images

Maximum # of up-to-date Prolog implementations with minimum overhead [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This is not directly a question regarding Prolog code, but rather about installing, administrating, and updating Prolog implementations...
Ideally, what I want to have on my machine is:
As many different Prolog implementations as possible
At all times, the most current (development) versions should be available
I might also want to have 2 or more versions side by side (stable / development)
A minimum overhead for installation, administration, etc.
I want to choose which Prolog implementation I use today after I start my machine up.
What can I do? What have you tried in this respect? I run Linux. Thank you in advance!

How to Monitor data or Structs in memory in Go [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to create a project or package that loads data (maybe 1 or 2 million items) in memory. I want to monitor this data and know if this data is well on memory, exist or no. In Java this can be done with JMX (Java Management Extensions) but in Golang I do not know how do it.
I want to do this in a production environment, not just a testing environment.
Any help would be appreciated.
You can use os.GetUsage to track memory usage. You can then either use a statsd client or direct UDP messages to update graphite (or whatever monitoring package you like).
You might also find this article Monitoring a Production Golang Server with Memstats helpful.

Standard way to share state machine between two languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Is there a standard way to share state-machines (that is share the machine and synchronize it's state) between two languages? I'm using the state_machine gem on a server and I need to synchronize the machine with another server that will be written in another language. Is there a standard way of accomplishing this so that I can maximize compatibility despite not knowing the other language? At this point, I'm thinking I'm just going to make my own "protocol" built with REST requests and sharing the initial machine structure using serialization.
I would accept "there is no standard way" as an answer.
There is no standard way for doing that ...

Design level patterns for highly available Linux applications [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Given toolkits like Linux-HA and cluster layers on top like Corosync; file-system replicators like DRBD and other various bits and pieces there are the components available to developers to build highly available, robust systems.
High-availability architecture-level patterns are often fairly easy to describe, but I'm looking for the level(s) below that.
While each of these toolkit-parts seems to be fairly well documented, and some of them show how to use them in a robust application, they don't show examples of an end-to-end or multi-resource-using application.
So, what are the concrete steps, patterns, recipes, etc. that should be followed in order for developed code to play nice in an environment like this?
What books, web-tutorials, etc., should I point a team to in order to refactor a working single-box custom TCP server (for example) and make it run under cluster control, writing to shared file system space, and working in such a way that when it fails over, it has a chance to recover and keep working.

Resources