Loading 30000 documents from wikipedia - hadoop

I have a wikipedia url and I want to load the content from that page and other referenced pages upto 30000 documents using wiki API, I can loop through the urls and do that but that is not an effiecient way of doing it. Is there any other way through which I can acheive this. I need this to populate my HDFS in hadoop.

You can download the wikimedia software and a database image, set up the wikipedia and access it locally. This is well described and should be a lot more efficient then requesting that number of pages through the net. see: http://www.igeek.co.za/2009/10/16/how-to-mirror-wikipedia/
There are also many other sources and also preprocessed pages. Here comes the question what you plan to do with the content in the next step.

There are a few ways to go about this. Toolserver users have direct database query access to all the metadata, but not text. If that suits you, you might be able to ask one of them to run a query through the query service. This is a pretty straight-forward way to find out what pages are linked, etc. and build a map of page ids or revision ids.
Otherwise, take a look at database dumps which are great for bulk work but will take some processing on your end.
Finally, Wikipedia is used to tons of bots and API scrapes. It's not ideal, but if nothing else suits you then run a timer that starts a new query once every second and you'll be done in 8 hours.

As it is said by Jeff and NilsB you have a wrong intent to crawl wikipedia for filling your HDFS. The right thing to do it is to download the whole wiki as a single file and to load it to HDFS.
But if we abstracted away from some details in your question it would transform into more general: How to crawl some sites specified by url using Hadoop?
So the answer is you should upload the file(s) with url to hdfs, write a mapper (accepting url, downloading a page and yielding it as key=url and value=page's body) and configure a Job to use NLineInputFormat for controlling the count of url each mapper'll process.
By control of that parameter you'll be able to control a level of parallelism through itself and the map slots count.

Related

How to build a price comparison program that scrapes the prices of a product across several websites

I am trying to build a price comparison program for personal use (and for practice) that allows me to compare prices of the same item across different websites. I have just started using the Scrapy library and played around by scraping websites. These are my steps whenever I scrape a new website:
1) Find the website's search url, understand its pattern, and store it. For instance, Target's search url is composed by a fixed url="https://www.target.com/s?searchTerm=" plus the search terms (in parsed url)
2)Once I know the website's search url, I send a SplashRequest using the Splash library. I do this because many pages are heavily loaded with JS
3)Look up the HTML structure of the results page and determine the correct xpath expression to parse the prices. However, many websites have results page in different formats depending on the search terms or product category, changing thus the page's HTML code. Therefore, I have to examine all the possible results page's formats and come up with an xpath that can account for all the different formats
I find this process to be very inefficient, slow, and inaccurate. For instance, at step 3, even though I have the correct xpath, I am still unable to scrape all the prices in the page (sometimes I also get prices of items that are not present in the HTML rendered page), which I dont understand. Also, I dont know whether the websites know that my requests come from a bot, thus maybe sending me a faulty or incorrect HTML code. Moreover, this process cannot be automated. For example, I have to repeat step 1 and 2 for every new website. Therefore, I was wondering if there was a more efficient process, library, or approach that I could use to help me finish this program. I also heard something about using the website's API, although I dont quite understand how it works. This is my first time doing scraping and I dont know too much about web technologies, so any help/advice is highly appreciate!
The most common problem with crawling is that in general, they are determining everything to be scraped syntactically, while conceptualizing the entities you are to be working with helps a lot, I am speaking from my own experience.
In a research about scraping I was involved in we have reached to the conclusion that we need to use a semantic tree. This tree should contain nodes, which represent important data for your purpose and a parent-child relation means that the parent encapsulates the child in the HTML, XML or other hierarchical structure.
You will therefore need some kind of concept about how you will want to represent the semantic tree and how it will be mapped with site structures. If your search method allows you to use the logical OR, then you will be able to define the same semantic tree for multiple online sources.
On the other hand, if the owners of some sites are willing to allow you to scrape their data, then you might ask them to define the semantic tree.
If a given website's structure is changed, then using a semantic tree more often than not you will be able to comply to the change by just changing the selector of a few elements, if the semantic tree's node structure remains the same. If some owners are partners in allowing scraping, then you will be able to just download their semantic trees.
If a website provides an API, then you can use that, read about REST APIs to do so. However, these APIs are probably not uniform.

crawl small homepage with metadata.transfer and N:M-relationships

hi folks,
We use StormCrawler with elasticsearch to make an index of our homepage which consist of "old pages" and "new pages".
My Question in short:
If two pages A(old),B(new) link to page X, how to pass metadata from B to X?
My Question in long:
We relauched our homepage step by step. So at time we have pdf-Files which are reachable via only the old html-pages, via only the new html-page or on both ways.
For "order by" purpose we must mark all pdf-Files which are reachable by the new html-pages.
So we insert "newHomepage=true" to seeds.txt and "metadata.transfer/-newHomepage" to "crawler-conf.yaml": Fine :-)
But for the pdf-Files which are reachable from old !and! new html-pages, we now have a race condition: If our pdf-File is "DISCOVERED" from an old page this information (newHomepage=false) is in Status-Index and can not be overridden.
( StatusUpdaterBolt does not override documents, IndexerBolt does override by default).
To make the thinks more complicate: in our case a URL (at html-page) to a PDF is redirected two times, before the file is delivered.
So from my point of view we have two possibilities:
Start the crawler two times. First we only index our new pages (and all reachable pdf files), second we index our old pages.
--> Problems with new pages which are changed after crawler was started
Store "outbound_links" and use them to set "newHomepage" independently from the crawler
--> short times with wrong metadata in index
Any advice or other ideas?
Best regards
Karsten
thanks for sharing your problem and great to hear that you are using SC. This is an interesting and unusual use case.
Your analysis of the problem is correct. An intuitive approach would be to extend the default StatusUpdaterBolt so that it updates the metadata if a document already exists. You'd need to remove the part that does the check on whether the doc has a status of DISCOVERED.
This would slow things down, but since you are dealing with a single website, this should not have a massive impact.
You could push the logic even further by setting a new nextFetchDate if the document had been fetched so that it gets refetched and updated quicker in the doc index (as opposed to the status one).

Hadoop for the Wikipedia pagecount dataset

I want to build a Hadoop-Job that basically takes the wikipedia pagecount-statistic as input and creates a list like
en-Articlename: en:count de:count fr:count
For that I need the different articlenames related to each language - i.e. Bruges(en, fr), Brügge(de), which the MediaWikiApi query articlewise(http://en.wikipedia.org/w/api.php?action=query&titles=Bruges&prop=langlinks&lllimit=500).
My question is to find the right approach to solve this problem.
My sketched approach would be:
Process the pagecount file line by line (line-example 'de Brugge 2 48824')
Query the MediaApi and write sth. like'en-Articlename: process-language-key:count'
Aggreate all en-Articlename-values to one line (maybe in a second job?)
Now it seems rather unhandy to query the MediaAPI for every line but currently can not get my head around a better solution.
Do you think the current approach for is feasible or can you think of a different one?
On a sidenote: The created job-chain shall be used to do some time-measuring on my (small) Hadoop-Cluster, so altering the task is still okay
Edit:
Here is a quite similar discussion which I just found now..
I think it isn't a good idea to query MediaApi during your batch processing due to:
network latency (your processing will be considerably slowed down)
single point of failure (if the api or your internet connection goes down your calculation will be aborted)
external dependency (its hard to repeat the calculation and got the same result)
legal issues and a ban possibility
The possible solution to your problem is to download the whole wikipedia dump. Each article contains links to that article in other languages in a predefined format, so you can easily write a map/reduce job that collects that information and builds a correspondence between English article name and the rest.
Then you can use the correspondence in a map/reduce job processing pagecount-statistic. If you do that you'll become independent to mediawiki's api, speed up your data processing and improve debugging.

Google crawling indexing algorithms

I am looking for some documents on how Google crawl and index content. I read many "light" papers and articles on what you need to do to improve your ranking and make sure your content is properly indexed but I am looking for some more advanced technical documents on how Google crawl and index content.
The things I would like to know more about:
What elements Google look for when it crawls: page content, URLs format, keywords, description etc...
How the index is updated?
Basically, I am trying to understand why some pages are indexed but not others even if the formats are similar. Why only 10% of my site's pages appear when I do a search on the entire domain even if I can see on my server logs that Google crawled every single link.
The answers to both things are closely-guarded trade secrets, ostensibly to prevent gaming the system.
Also keep in mind that Google makes over 400 algorithmic changes per year, making it close to impossible for an outsider to be accurate and up-to-date. Short of working for Google, you're likely not going to find an in-depth and accurate answer.
However, Matt Cutts, head of the web spam team, frequently provides the most accurate insights in how Google handles content, both on his blog and on the GoogleWebmasterHelp YouTube channel. It's worth going through his content to get a much better understanding of Google's methodology.
In order to provide a technical approach of how a webcrawler works I will suggest you to take a deep look into nutch.apache.org solution.
A typical webcrawler displays the following areas, a fetcher, a parser, and indexer and a searcher. To put it briefly a webcrawler fetch all urls available on a website and creates segments where its store up to 101kb per page. Those pages are parsed but typical words such as and-or-the are not stored but other words are analyzed using bayesian calculations in order to make a rank.
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. These tasks are mainly performed by storing a list of occurrences of each search critera, typically in the form of a hash table or binary tree using an inverted index.
As Mark stated Google´s calculations are mainly trade secrets but Patents issued by google could be a good start. Pagerank http://en.wikipedia.org/wiki/PageRank analyses backlinks mainly and the importance that websites pointing to your site have on people´s preferences. In my experience its important to offer an xml sitemap stating all your webpages at your site. On that sitemap you could define the crawl frequency for each page. gsitecrawler.com/ is an interesting possibility.
Google Website Optimizer will give you the chance to see what is google finding on your site, logs are ok but probably the robot finds problem and the best way to know that is with google´s website optimizer in order to display errors.
Finally most of your concerns are things that SEO´s specialist live for, I suggest you to check sites like seomoz.com and their tools... You will learn how to position your website better on organic results on search engines.
hope it helps!, sebastian.
"Yes" Google like fresh & unique content.
Use Google webmaster guideline "try this instead" H1 or H2 meta tag on your HTML programming under the head tag ....your keyword. Anchor have to must use your business related keywords in H1, H2, it can help your site search engine.
Also use for Rich snippets in this tag..!
It scans you web page very precisely and sensitively. Factors like you have javascript embedded or in different file matter, whether you are using frames in designing or using heavy graphics can reduce the ranking of your page. Keywords are obviously rank affecting entities. Broken links also bring your website ranking down.
Basically you can refer to http://www.tutorialspoint.com/seo/ to go through all the important points of google's crawler. This will take a maximum of 40 mins.
MapReduce: Simplified Data Processing on Large Clusters
I analysed the latest algorithm and found that now
Google gives more importance to CONTENT rather than LINKS.
So if your content is good enough with properly available tags, Google will automatically generate index for you. I would suggest H1 - H6 all to be used in good manner.

How to generate sitemap on a highly dynamic website?

Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.
On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.
That is the only way search engines like Google can effectively crawl the site.
Jeff Atwood talks about it in a blog post: The Importance of Sitemaps
This is from Google's webmaster help page on sitemaps:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily discovered by Googlebot during
the crawl process - for example, pages
featuring rich AJAX or Flash.
Your site is new and has few links to it. (Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large archive of content pages that are not well linked
to each other, or are not linked at
all.
There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).
To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.
I'd like to share my solution here just in case it helps someone as well.
It took me reading this question and many others to decide what to do.
My site structure.
Static pages
Home (Highly dynamic. Cached for 30 mins)
Artists, Albums, Songs, Playlists and Albums (Paginated List)
Legal (Static page with Terms etc)
...etc
Dynamic Pages
Artists, Albums, Songs, Playlists and Albums detail pages
My approach.
sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.
So for e.g, 1900 songs = 1.9 = 2.
I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.
The output is returned uncached. I want this to always be fresh.
sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.
sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.
$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...
I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks.
Else, I only cache it for a few hours.
I guess this can help anyone else implement their own system.
Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).
Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.
For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think

Resources