Can't scrape particular websites using Scrapinghub

Can't scrape particular websites using Scrapinghub - scrapinghub

I am using the autoscraping feature in the scrapinghub service.
While building and deploying the autoscraper, I found that the site I wanted to scrape would never return any Requests, and would time out around 3.5 minutes.
So, I began reading the documentation to see if I could figure out why this was happening (How to check if site is suitable for autoscraping).
I followed the steps and temporarily removed Javascript from my browser (chrome) and found that I had no problems viewing the site I wanted to scrape.
My question is, at risk of sounding vague, what might be some other reasons that a site is not scrapeable, aside from Javascript? Are there some other ideas regarding how to diagnose a problem like this?

Related

How to apply pagespeed insights results

Well, I trying establish a web page with a wordpress and GoDaddy hosting. I want to make fast web page, because people says fast web pages appear on first line at Google (as specially mobile web page speed is very important people says). So want to make very fast web page but my level of knowledge is not very advanced, I progress by learning.
If I test my web page with Insights, mine mobile score is about 60-70. If I read reports of Insights there are lots of improvements links appear at blow. I want to learn how to fix that. If you help me make an example, I will do the others myself.
If we start at first problem which is /css?family=…(fonts.googleapis.com) this problem seen below of "Eliminate resources that prevent rendering" topic. So how to fix it. What should I do?
Also at the "covorage" tab there are some source codes are seen and it is not using. For example I am not using easy-sheare plugin (secong row at the image) at homepage.
How to remove safely that codes from home page. If I can learn how one is made, I can correct the others myself.

The issue you are running into is something I have seen over and over again. GoDaddy and Wordpress sites generally are bloated and perform poorly.
Here are some tips to improve your speed & get a better PS ranking.
Hosting: Do you need to be on Godaddy? I have seen this time and time again. Most websites on GD are SLOW. GD is good for domain registration, not for hosting. Most non-tech folks do not know any better. Try using Amazon Lightsail, AWS-S3, Google Firebase, or Netlify. They all offer much faster page loads by reducing initial server response time. And they are surprisingly simple to learn and deploy.
CDN: You must use a content-distribution-network (CDN). Check out Cloudfront. They offer a free tier that works quite well.
Wordpress: This is your real issue. Wordpress is neither easy to build nor easy to maintain. You need multiple plugins to make the site perform. Best you build your own. If you have to be on Wordpress checkout image optimizers, minifiers, and cache plugins. Gumlet, WP Rocket, Shortpixel are quite popular to improve speed.

AJAX Crawling with question mark instead of hashbang

Where I'm at: I've read Google's documentation regarding it's AJAX crawling, and I've searched around a bit in this website and others, but I'm quite confused, as it seems that all problems address the same issue: AJAX crawing with hashbangs?
I've developed an app which, among other purposes, let's the user search for locations worldwide, using an AJAX searcher quite similar to Google's, but my app uses exclusively the question mark in AJAX, instead of hashbang. Due to compatibility issues, changing it to the hashbang is not an option.
Not only am I largely confused by the fact that I could not find anyone else using the question mark instead of the hashbang, I'm also wondering if there is any documentation regarding my issue: how to let google bot crawl all my AJAX content when I'm using the question mark instead of a hashbang in my AJAX app.

The AJAX crawling schema was created explicitly for applications and websites using hashbang (#!) in the URL structure, because the fragment part of the URLs only exist on the client side; the URL rewriting in the specs, i.e. from #! to ?_escaped_fragment_= is meant to solve that.
Since most of the web is already making use of Javascript in a way or other, we (Google) needed a better solution, so we started executing Javascript in the pages we crawled and effectively render every page, just like a normal browser would. To quote our blogpost, Understanding web pages better:
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
You can also see what we "see" using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google
Before you do anything else, please try to fetch a few pages from your site with Fetch as Google. You might not have to do anything at all, it might actually work out of the box. And the good news is that it's not only Google that's rendering pages!

How to do the performance measurement of website manually by coding?? Any Idea

I have to develop the application to check the performance of the website ..
I need some guidance about how can I do it with coding.I have searched a lot but google is giving only tools for doing that. anyone have any Idea about this?..suggest me some path where I should work upon.Help will be appreciated.Waiting for your responses.
I have seen manny sites are giving results on the basis of the pages of our websites so they might be getting the pages from the sitemap,that is my point of view I don't know the rest.I think they might be getting the page and logging the load time and response time that is what I have thought but still I am not Knowing how log that things ..I need your guidance to proceed further Thank you.

Are we doing error codes wrong?

I have great luck using a combination of Google (and usually StackOverflow) to locate help with errors in software. But I'm wondering if there's a better way. How about tagging all errors with a unique ID?
This is just a suggestion, hopefully someone will take this in an even better direction. As a starting point I see errors registered the way we register web sites. Maybe they are web sites. Each error would have a URL. And that URL would have an associated abbreviated version for cases where we want to reference the error but want to save space.
The app developer would be under no obligation to provide anything at the error URL location. That would be optional but nice. Maybe the URLs would all be based on a global domain like wikipedia where anyone can contribute info. My main goal though is just to tag errors with something to make web searched more effective when I'm looking for help.

Google not indexed url

Hello i have personal site and about 1 month ago i rebuild the complete site. I sent a new sitemap.xml file and is not indexed yet, but im having 404 crawler errors with the old url.
Google said the sitemap is correct,so, any idea, i must do something, or just wait longer?
Is not really important because is just a personal site, but i`m just curious about what is that happening.
Sorry for my bad english, but im spanish and thanks in advance

this just takes time.
I experienced some speed-up in that progress by using other google services as places or analytics. but to answer your question:
If your sitemap has been detected correctly it will work but this might take some time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can't scrape particular websites using Scrapinghub - scrapinghub

Related

How to apply pagespeed insights results

AJAX Crawling with question mark instead of hashbang

How to do the performance measurement of website manually by coding?? Any Idea

Are we doing error codes wrong?

Google not indexed url

Categories

Resources