How to build a powerful crawler like google's? [closed] - performance

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I want to build a crawler which can update hundreds of thousands of links in several minutes.
Is there any mature ways to do the scheduling?
Is distributed system needed?
What is the greatest barrier that limits the performance?
Thx.

For Python you could go with Frontera by Scrapinghub
https://github.com/scrapinghub/frontera
https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst
They're the same guys that make Scrapy.
There's also Apache Nutch which is a much older project.
http://nutch.apache.org/

You would need a distributed crawler but don't reinvent the wheel, use Apache Nutch. it was built exactly for that purpose, is mature and stable and used by a wide community to deal with large scale crawls.

The amount of processing and memory required would need distributed processing unless you are willing to compromise speed. Remember you'd be dealing with billions of links and terabytes of text and images

Related

Is it possible to high performance computing by Golang and CUDA? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've googled for a while and the only useful infos are:
github.com/barnex/cuda5
mumax.github.io/
Unfortunately, the latest Arch Linux only provides CUDA 7.5 package, so the barnex's project may be not supported.
Arne Vansteenkiste recommends concurrency rather than pure Golang or Golang plus CUDA. What's more, there's someone says the same idea that "Wouldn't it be cool to start a goroutine on a GPU and communicate with it via channels?". I think both of these ideas are wonderful since I would like to change the existing code as little as possible instead of refactoring the whole program. Is the idea possible, or is there some documents introducing this topic in details?
Update
It seems that there's two bindings to HPC in Golang:
CUDA (< 6.0): github.com/barnex/cuda5
OpenCL: github.com/rainliu/gocl
Both of them are less documented, currently what I got is only Macro13's answer, very helpful, but it's more about java . So please help me some detailed materials in Golang. Thanks!

Difference between hadoop and google analytics [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Helly guys,
I am new to hadoop and everything around big data.
while my research about social media data integration with big data i found a lot about hadoop.
But I know there is google analytics too, if i want to observe social media and get some statistics.
So, why are so many companies using hadoop instead of google analytics?
What is the difference between those two?
Thank you for your answer :)
I will try and answer this as good as possible, as it's a strange question :)
The reason I say it's strange is they are not really related and trying to find a co-relation to compare is tricky.
GA - Typically used to track web behavior. Provides a nice UI and is typically digestible by non-technical people (marketing etc) to find insights.
Hadoop - Hadoop at its core is a file system (think of a very large hard-drive), it stores data in a distributed fashion (on n number of servers). It's claim to fame is map/reduce and the plethora of applications like Hive or Pig to analyze data sitting in Hadoop.
A better comparison to the products you mentioned would be something like:
Why would I use Google Analytics vs Comscore? (web analytics)
Why would I use Hadoop vs Postgres? (data storage and data analyses)

How a Search Script like InoutScripts' Inout Spider attains scalability? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to know more about how these Search Engines scripts like InoutScripts' Inout Spider attains scalability.
Is it because of the technology they are using.
Do u think is it because of the technology of combining hadoop and hypertable.
For storing and large scale processing of data sets on clusters of commodity hardware, Hadoop is used which is an open source software framework. Hadoop is an Apache top level project being built and used by a global community of contributors and users. At the application layer itself Hadoop detects and handle failures rather than rely on hardware to deliver high availability.

Projects handler program [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
In our company we need a project handler so we decided to write our own.
We use CMake and bazaar and we still don't know if to store the informations of the
projects in XML format or in a database.
We are locked at this point: we would like to use as less languages/tools as possible
but we cannot find a way to interface CMake with XML files or databases.
An idea could be Python but it would be really annoying to use a new language just for an interface. We've seen that there's a Python framework (Waf) but we have already used CMake for all our projects and it would take a lot of time to convert all.
We work with Ubuntu and Windows.
Suggestions ?
thanks in advance
Rather than make your own tool, use an off-the-shelf product like something from the Jira suite, or BuildMaster. Many of these have great integration with most build software and don't require you to write and maintain your own stack just to manage projects.
Focus your developer time on solving your business problems, not on reinventing the wheel. Their time is MUCH more valuable than the cost of using a ready-made solution.

Buy or build tool for Data Reporting? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We have been asked to provide a data reporting solution. The followng are the requirements:
i. The client has a lot of data which is generated everyday as an outcome of the tests they run. These tests are run at several sites and they get automatically backed up into a central server.
ii. They already have perl scripts which post process them and generates excel based reports.
iii. They need a web based interface for comparing those reports and they need to mark and track issues which might be present in those data.
I am confused if we should build our own tool for this or we should go for already exiting tool(any suggestions?). Can you please provide supportive arguments for the decision that you would suggest?
You need to narrow down your requirements (what kind of data needs to be compared, and in which format?). Then check if there is already a software available (commercial or free) that fulfills your needs. Based on that, decide if its better (i.e. cheaper) to implement the functionality yourself, or use the other software.
Don't reinvent the wheel.
There are quite a few tools out there that specialise in this sort of thing, my gut feeling is that you can find something ready made that does what you need.
As a side note, that tool may also be a better solution for creating those excel reports than the perl scripts.

Resources