crawl small homepage with metadata.transfer and N:M-relationships - elasticsearch

hi folks,
We use StormCrawler with elasticsearch to make an index of our homepage which consist of "old pages" and "new pages".
My Question in short:
If two pages A(old),B(new) link to page X, how to pass metadata from B to X?
My Question in long:
We relauched our homepage step by step. So at time we have pdf-Files which are reachable via only the old html-pages, via only the new html-page or on both ways.
For "order by" purpose we must mark all pdf-Files which are reachable by the new html-pages.
So we insert "newHomepage=true" to seeds.txt and "metadata.transfer/-newHomepage" to "crawler-conf.yaml": Fine :-)
But for the pdf-Files which are reachable from old !and! new html-pages, we now have a race condition: If our pdf-File is "DISCOVERED" from an old page this information (newHomepage=false) is in Status-Index and can not be overridden.
( StatusUpdaterBolt does not override documents, IndexerBolt does override by default).
To make the thinks more complicate: in our case a URL (at html-page) to a PDF is redirected two times, before the file is delivered.
So from my point of view we have two possibilities:
Start the crawler two times. First we only index our new pages (and all reachable pdf files), second we index our old pages.
--> Problems with new pages which are changed after crawler was started
Store "outbound_links" and use them to set "newHomepage" independently from the crawler
--> short times with wrong metadata in index
Any advice or other ideas?
Best regards
Karsten

thanks for sharing your problem and great to hear that you are using SC. This is an interesting and unusual use case.
Your analysis of the problem is correct. An intuitive approach would be to extend the default StatusUpdaterBolt so that it updates the metadata if a document already exists. You'd need to remove the part that does the check on whether the doc has a status of DISCOVERED.
This would slow things down, but since you are dealing with a single website, this should not have a massive impact.
You could push the logic even further by setting a new nextFetchDate if the document had been fetched so that it gets refetched and updated quicker in the doc index (as opposed to the status one).

Related

How to go down (recursively) in UIPath

I want to go down recursively and automatically at the host and clusters tab (the blue one in the picture), and then take the text like guest os, compability, etc.
I already know how to get the text from summary, but I got the problem to loop (go down) at the host and cluster tabs.
Any idea how to do it?
Thank you very much.
It's probably not the best idea to have one robot do everything - i.e. setting filters, parsing results, clicking on each result, parsing data, then going back, and so on. Instead of that approach, divide and conquer. Create multiple sequences/workflows, each one with one specific task in mind.
Here's how I would tackle it:
Create a sequence that opens the target page, setting the filters (e.g. Germany, as mentioned in your link: https://www.autoscout24.de/ergebnisse?cy=D&powertype=kw&atype=C&ustate=N%2CU&sort=standard&desc=0&page=1&size=20)
Have the same robot extract the link for each result. Don't go into the details, this could potentially be the task of another robot.
Have the links stored in a queue. Read more about Queues and Transactions here.
Create another sequence that takes the next item in the queue and open the stored link. This will take you directly to the results page (e.g. https://www.autoscout24.de/angebote/opel-corsa-12v-city-servo-ahk-benzin-silber-84a66193-b6d6-4fc9-b222-ec5a7319f221?cldtidx=1)
Parse any kind of data you need.
This comes with the benefit that you can use multiple robots at the same time extracting data, potentially increasing scraping significantly. The Queues and Transactions feature will make sure each result is visited only once, and that multiple robots don't process the same item multiple times.
Edit: you might want to start with ReFramework, which I'd recommend.

Loading 30000 documents from wikipedia

I have a wikipedia url and I want to load the content from that page and other referenced pages upto 30000 documents using wiki API, I can loop through the urls and do that but that is not an effiecient way of doing it. Is there any other way through which I can acheive this. I need this to populate my HDFS in hadoop.
You can download the wikimedia software and a database image, set up the wikipedia and access it locally. This is well described and should be a lot more efficient then requesting that number of pages through the net. see: http://www.igeek.co.za/2009/10/16/how-to-mirror-wikipedia/
There are also many other sources and also preprocessed pages. Here comes the question what you plan to do with the content in the next step.
There are a few ways to go about this. Toolserver users have direct database query access to all the metadata, but not text. If that suits you, you might be able to ask one of them to run a query through the query service. This is a pretty straight-forward way to find out what pages are linked, etc. and build a map of page ids or revision ids.
Otherwise, take a look at database dumps which are great for bulk work but will take some processing on your end.
Finally, Wikipedia is used to tons of bots and API scrapes. It's not ideal, but if nothing else suits you then run a timer that starts a new query once every second and you'll be done in 8 hours.
As it is said by Jeff and NilsB you have a wrong intent to crawl wikipedia for filling your HDFS. The right thing to do it is to download the whole wiki as a single file and to load it to HDFS.
But if we abstracted away from some details in your question it would transform into more general: How to crawl some sites specified by url using Hadoop?
So the answer is you should upload the file(s) with url to hdfs, write a mapper (accepting url, downloading a page and yielding it as key=url and value=page's body) and configure a Job to use NLineInputFormat for controlling the count of url each mapper'll process.
By control of that parameter you'll be able to control a level of parallelism through itself and the map slots count.

Key based caching

I'm reading this article:
http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works
I'm not using rails so I don't really understand their example.
It says in #3:
When the key changes, you simply write the new content to this new
key. So if you update the todo, the key changes from
todos/5-20110218104500 to todos/5-20110218105545, and thus the new
content is written based on the updated object.
How does the view know to read from the new todos/5-20110218105545 instead of the old one?
I was confused about that too at first -- how does this save a trip to the database if you have to read from the database anyway to see if the cache is valid? However, see Jesse's comments (1, 2) from Feb 12th:
How do you know what the cache key is? You would have to fetch it from the database to know the mtime right? If you’re pulling the record from the database already, I would expect that to be the greatest hit, no?
Am I missing something?
and then
Please remove my brain-dead comment. I just realized why this doesn’t matter: the caching is cascaded, so yes a full depth regeneration incurs a DB hit. The next cache hit will incur one DB query for the top-level object—all the descendant objects are not queried because the cache for the parent object includes cached versions for the children (thus, no query necessary).
And Paul Leader's comment 2 below that:
Bingo. That’s why is works soooo well. If you do it right it doesn’t just eliminate the need to generate the HTML but any need to hit the db. With this caching system in place, our data-vis app is almost instantaneous, it’s actually useable and the code is much nicer.
So given the models that DHH lists in step 5 of the article and the views he lists in step 6, and given that you've properly setup your relationships to touch the parent objects on update, and given that your partials access your child data as parent.children, or even child.children in nested partials, then this caching system should have a net gain because as long as the parent's cache-key is still valid then the parent.children lookup will never happen and will also be pulled from cache, etc.
However, this method may be pointless if your partials reference lots of instance variables from the controller since those queries will already have been performed by the time Rails sees the calls to cache in the view templates. In that case you would probably be better off using other caching patterns.
Or at least this is my understanding of how it works. HTH

How to generate sitemap on a highly dynamic website?

Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.
On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.
That is the only way search engines like Google can effectively crawl the site.
Jeff Atwood talks about it in a blog post: The Importance of Sitemaps
This is from Google's webmaster help page on sitemaps:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily discovered by Googlebot during
the crawl process - for example, pages
featuring rich AJAX or Flash.
Your site is new and has few links to it. (Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large archive of content pages that are not well linked
to each other, or are not linked at
all.
There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).
To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.
I'd like to share my solution here just in case it helps someone as well.
It took me reading this question and many others to decide what to do.
My site structure.
Static pages
Home (Highly dynamic. Cached for 30 mins)
Artists, Albums, Songs, Playlists and Albums (Paginated List)
Legal (Static page with Terms etc)
...etc
Dynamic Pages
Artists, Albums, Songs, Playlists and Albums detail pages
My approach.
sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.
So for e.g, 1900 songs = 1.9 = 2.
I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.
The output is returned uncached. I want this to always be fresh.
sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.
sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.
$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...
I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks.
Else, I only cache it for a few hours.
I guess this can help anyone else implement their own system.
Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).
Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.
For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think

Method to detect a parked page?

Anyone know of a way to programatically detect a parked web page? That is, those pages that you accidentally type in (or intentionally sometimes) and they are hosted by a domain parking service with nothing but ads on them.
I am working on a linking network and want to make sure that sites that expire don't end up getting snatched by someone else and then being a parked page.
Here is a test that I think may catch a decent number of them. It takes advantage of the fact you don't actually want to have real web sites up for your parked domains. It looks for the wildcarding of both subdomain and path. Lets say we have this URL in our system
http://www.example.com/method-to-detect-parked.
First I would check the actual URL and hash it or grab a copy for comparison.
My second check would be to
http://random.example.com/random
If it matches the original link or even succeeds, you have a pretty good indicator that the page is parked. If it fails I might check both the subdomain and path individually. If the page randomly changes some elements, you may want to choose a few items to compare. For example make a list of links included in the page and compare those or maybe the title tag.
I would say that you'll have to examine the WHOIS records for the sites in question and/or the actual content of the pages and develop some heuristics as to what constitutes a "parked page".
Take goooogle.com, looking at their WHOIS record shows that they are owned by "Privacy Protection" and that their DNS servers are ns1/ns2.fastpark.net. If you look at the source for the site, they're silly enough to have a CSS file named "style_park.css" :)
All in all, I don't think you'll be able to come up with a generic way to do it. You'll probably end up with some ever evolving rule base or blacklist
Look at the creation date of the dns/whois record, and compare it to the add date of the link. If the DNS is newer, that's a link that needs manual checking.
Or: check http://example.com/ and http://example.com/xxxxxxrandomstringxxxxx . If those two pages are identical, you've got some sort of problem that needs manual checking. Either the primary page you wanted to link to is broken, or the domain is parked and all pages return the same value. This test is not 100%, because some parked pages echo back elements from the URL.
If you just want to check an existing website, a service like http://www.linkalarm.com/ does this well.
You could just rely on your users to "Report this link"... which would put it into a queue to review later?

Resources