Set proper header for crawler to prevent cached html

Set proper header for crawler to prevent cached html - open-uri

Hello everyone i am building a small web crawler that fetch news from some websites.
I am using Typhoeus.
My code is like this:
request = Typhoeus::Request.new(url, timeout: 60)
request.on_complete do |response|
doc = Nokogiri::HTML(response.body)
root_url = source.website.url
links = doc.css(css_selectors).take(20)
end
hydra.queue(request)
hydra.run
The problem is some websites requests return a chached old versions of the page. i tried setting the headers and included "Cache-Control" => 'no-cache' but that didn't help!
Any help will be appreciated.
The same things happens when using open-uri.
one of the website's reponse header:
{"Server"=>"nginx/1.10.2", "Date"=>"Sat, 07 Jan 2017 12:43:54 GMT", "Content-Type"=>"text/html; charset=utf-8", "Transfer-Encoding"=>"chunked", "Connection"=>"keep-alive", "X-Drupal-Cache"=>"MISS", "X-Content-Type-Options"=>"nosniff", "Etag"=>"\"1483786108-1\"", "Content-Language"=>"ar", "Link"=>"</taxonomy/term/1>; rel=\"shortlink\",</Actualit%C3%A9s>; rel=\"canonical\"", "X-Generator"=>"Drupal 7 (http://drupal.org)", "Cache-Control"=>"public, max-age=0", "Expires"=>"Sun, 19 Nov 1978 05:00:00 GMT", "Vary"=>"Cookie,Accept-Encoding", "Last-Modified"=>"Sat, 07 Jan 2017 10:48:28 GMT", "X-Cacheable"=>"YES", "X-Served-From-Cache"=>"Yes"}

This should work:
"Cache-Control" => 'no-cache, no-store, must-revalidate'
"Pragma" => 'no-cache'
"Expires" => '0'

Related

How to use Typhoeus::Request object using https

I'm trying to make an https request using the Typhoeus::Request object and i don't get it working.
The code i'm running is something like this:
url = "https://some.server.com/"
req_opts = {
:method => :get,
:headers => {
"Content-Type"=>"application/json",
"Accept"=>"application/json"
},
:params=>{},
:params_encoding=>nil,
:timeout=>0,
:ssl_verifypeer=>true,
:ssl_verifyhost=>2,
:sslcert=>nil,
:sslkey=>nil,
:verbose=>true
}
request = Typhoeus::Request.new(url, req_opts)
response = request.run
The response i'm getting is this:
HTTP/1.1 302 Found
Location: https://some.server.com:443/
Date: Sat, 27 Apr 2019 02:25:05 GMT
Content-Length: 5
Content-Type: text/plain; charset=utf-8
Why is this happening?

Well it's hard to know because your example is not a reachable url. But 2 things I see is that you are not passing an ssl cert or key. But also 302 indicates a redirect. You can try to follow redirection but your first problem is probably you don't need to set SSL options, why are you?
See if you try the following options:
req_opts = {
:method => :get,
:headers => {
"Content-Type"=>"application/json",
"Accept"=>"application/json"
},
:params=>{},
:params_encoding=>nil,
:timeout=>0,
:followlocation => true,
:ssl_verifypeer=>false,
:ssl_verifyhost=>0,
:verbose=>true
}
See the following sections for more info
https://github.com/typhoeus/typhoeus#following-redirections
https://github.com/typhoeus/typhoeus#ssl

how to call xml request in ruby using httparty?

while calling xml request api in ruby it is getting xml parser error as response.
API call
require 'httparty'
response = HTTParty.post("http://www.99acres.com/99api/v1/getmy99Response/test/uid/",
:headers => {"Accept" => "application/xml", "Content-Type" =>"application/xml"},
:body => '<?xml version="1.0"?><query>
<user_name>test</user_name><pswd>testest</pswd><start_date>2019-03-25 12:03:00</start_date><end_date>2019-04-24 12:04:00</end_date></query>'
)
Error Response
ERROR-0000XML Parsing
Error
How to call XML request API calling in ruby?.

You wrote correct request to API, but looks like they just don't give you access to API, looks like your login credential not correct, but there no problem with your code!
This response from API!
#<HTTParty::Response:0x7ff11da769b0 parsed_response={"response"=>{"code"=>"1", "msg"=>"INVALID KEY"}}, #response=#<Net::HTTPUnauthorized 401 Unauthorized readbody=true>, #headers={"server"=>["nginx"], "content-type"=>["text/xml;charset=UTF-8"], "content-length"=>["85"], "content-security-policy-report-only"=>["block-all-mixed-content; report-uri https://track.99acres.com/csp_logging.php;"], "etag"=>["\"d41d8cd98f00b204e9800998ecf8427e\""], "date"=>["Wed, 24 Apr 2019 18:46:46 GMT"], "connection"=>["close"], "set-cookie"=>["99_ab=20; expires=Thu, 23-Apr-2020 18:46:46 GMT; Max-Age=31536000; path=/"]}>

Youtube Live API - Insert broadcast returns Insufficient scope error for sometime

After creating few successful youtube live broadcasts, I started getting 403 error with following in the response.
{ status code: 403, headers {
"Cache-Control" = "private, max-age=0";
"Content-Encoding" = gzip;
"Content-Length" = 225;
"Content-Type" = "application/json; charset=UTF-8";
Date = "Mon, 06 Mar 2017 11:46:15 GMT";
Expires = "Mon, 06 Mar 2017 11:46:15 GMT";
Server = GSE;
Vary = "Origin, X-Origin";
"Www-Authenticate" = "Bearer realm=\"https://accounts.google.com/\", error=insufficient_scope, scope=\"https://www.googleapis.com/auth/youtube\"";
"alt-svc" = "quic=\":443\"; ma=2592000; v=\"36,35,34\"";
"x-content-type-options" = nosniff;
"x-frame-options" = SAMEORIGIN;
"x-xss-protection" = "1; mode=block";
}
My app got the ["/auth/youtube", "/auth/youtube.upload", "/auth/youtube.force-ssl", "/auth/youtube.readonly"] permissions from the user.
I have verified this with oauth2/token_info end point of Google API explorer. My access token is valid and has all required permissions.
This happened to me 2 days ago. But it started working a couple of hours. This again started happening.
The same API is working at the same time for a different account. I logged out of google and got a new access token but the issue still exists.

Parse.com Analytics not showing

I'm trying to connect to Parse.com 's REST-API via NSURLConnection to track AppOpened metadata.
I get 200 OK back from the API and the headers are the same to the cURL headers but my API calls are not being represented in the data browser on Parse.com . Is NSURLConnection doing something silly I don't know of? API response is the same but one request gets represented while the other one isn't.
NSLog output:
<NSHTTPURLResponse: 0x7ff5eb331ca0> { URL: https://api.parse.com/1/events/AppOpened } { status code: 200, headers {
"Access-Control-Allow-Methods" = "*";
"Access-Control-Allow-Origin" = "*";
Connection = "keep-alive";
"Content-Length" = 3;
"Content-Type" = "application/json; charset=utf-8";
Date = "Sun, 04 Jan 2015 22:42:54 GMT";
Server = "nginx/1.6.0";
"X-Parse-Platform" = G1;
"X-Runtime" = "0.019842";
} }
cURL output:
HTTP/1.1 200 OK
Access-Control-Allow-Methods: *
Access-Control-Allow-Origin: *
Content-Type: application/json; charset=utf-8
Date: Sun, 04 Jan 2015 23:03:51 GMT
Server: nginx/1.6.0
X-Parse-Platform: G1
X-Runtime: 0.012325
Content-Length: 3
Connection: keep-alive
{}
It's the same output. What am I doing wrong? Has anyone experience with this?

Turns out Parse was showing funny API keys the moment I copied them out of the cURL example they provide in their lovely docs. Don't know whose analytics I screwed over but I'm terribly sorry and it wasn't my fault!
Always copy your API keys out of [Your-Parse-App-Name]->Settings->Keys
It probably was just a stupid glitch that happened on the Server.

Mechanize and NTLM Authentication

The following code generates a 401 => Net::HTTPUnauthorized error.
From the log:
response-header: x-powered-by => ASP.NET
response-header: content-type => text/html
response-header: www-authenticate => Negotiate, NTLM
response-header: date => Mon, 02 Aug 2010 19:48:17 GMT
response-header: server => Microsoft-IIS/6.0
response-header: content-length => 1539
status: 401
The Script is as follows:
require 'rubygems'
require 'mechanize'
require 'logger'
agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.user_agent_alias = 'Windows IE 7'
agent.basic_auth("username","password")
page = agent.get("http://server/loginPage.asp")
I believe the reason for the 401 is that I need to be authenticating using NTLM, but I have been unable to find a good example of how to do this.

agent.add_auth('http://server', 'username', 'password', nil, 'domain.name')
http://mechanize.rubyforge.org/Mechanize.html
tested:
Windows Server 2012 R2 + IIS 8.5
Ruby 1.9.3

Mechanize 2 supports NTLM auth:
m = Mechanize.new
m.agent.username = 'user'
m.agent.password = 'password'
m.agent.domain = 'addomain'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Set proper header for crawler to prevent cached html - open-uri

This should work: "Cache-Control" => 'no-cache, no-store, must-revalidate' "Pragma" => 'no-cache' "Expires" => '0'

Related

How to use Typhoeus::Request object using https

how to call xml request in ruby using httparty?

Youtube Live API - Insert broadcast returns Insufficient scope error for sometime

Parse.com Analytics not showing

Mechanize and NTLM Authentication

Categories

Resources