I use Curl inside a Ruby script to scrape this kind of page:
https://www.example.com/page.html?a=1&b=2&c=3&d=4
Using this code:
value = `curl #{ARGV[0]} | grep "findMe:"`
result = value.scan(/findMe: (.*)/).flatten.first.split('$').last.gsub(',', '').to_f
puts result
But since that page is using a 301 redirect, it was failing, so I added the -L
attribute:
value = `curl -L #{ARGV[0]} | grep "findMe:"`
result = value.scan(/findMe: (.*)/).flatten.first.split('$').last.gsub(',', '').to_f
puts result
It now returns the result fine, but the script doesn't end. on SSH termail it is like asking me about the other parameters, after displaying result. so I must press enter to abort the script.
When I implement this in a web app, the script doesn't work, probably due to this parameter related problem.
Therefore, how I can I tell curl to abort right after displaying the result, or, to ignore any parameters in the given url, in the first place (Since they don't affect page content). So it would auto scrape this part:
https://www.example.com/page.html
Related
I'm trying to get/download some files from an url. I'm make a tiny script in ruby to get this files. Follow the script:
require 'nokogiri'
require 'open-uri'
(1..2).each do |season|
(1..3).each do |ep|
season = season.to_s.rjust(2, '0')
ep = ep.to_s.rjust(2, '0')
page = Nokogiri::HTML(open("https://some-url/s#{season}e{ep}/releases"))
page.css('table.table tbody tr td a').each do |el|
link = el['href']
`curl "https://some-url#{link}"` if link.match('sujaidr.srt$')
end
end
end
puts "done"
But the response from curl is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL:
/some-url/s0Xe0Y/releases. If not click the link.
When I use wget the redirected page is downloaded. I tried to set the user agent but not works. The server always redirect the link only when I try download the files through curl or others cli's like wget, aria2c, httpie, etc. And I can't find any solution for now.
How can I do this?
Solved
I decide use Watir webdriver to do this. Works great for now.
If you want to download the file, rather then the page doing the redirection try using the option -L within your code for example:
curl -L "https://some-url#{link}"
From the curl man:
-L, --location
(HTTP) If the server reports that the requested page has moved to a different
location (indicated with a Location: header and a 3XX
response code), this option will make curl redo the request on
the new place.
If you are using ruby, instead of calling curl or other 3rd party tools, you may cat to use something like this:
require 'net/http'
# Must be somedomain.net instead of somedomain.net/, otherwise, it will throw exception.
Net::HTTP.start("somedomain.net") do |http|
resp = http.get("/flv/sample/sample.flv")
open("sample.flv", "wb") do |file|
file.write(resp.body)
end
end
puts "Done."
Check this answer from where the example came out: https://stackoverflow.com/a/2263547/1135424
I want to write some code i can run in the bash that takes a list of URL's and checks if they return a 404. If the site is not returning a 404 i need the url to be written to the output list.
So in the end i should have a list with working sites.
I do not know how to realize the code.
This looks like something that could work right?:
How to check if a URL exists or returns 404 with Java?
You can use this code and build on it as necessary:
#!/bin/bash
array=( "http://www.stackoverflow.com" "http://www.google.com" )
for url in "${array[#]}"
do
if ! curl -s --head --request GET ${url} | grep "404 Not Found" > /dev/null
then
echo "Output URL not returning 404 ${url}"
fi
done
Thanks for your help. I found a package for linux called linkchecker. It does exactly what i want.
I have fingreprint.txt at the location "#{node['abc.d']}/fingreprint.txt"
The contents of the file are as below:
time="2015-03-25T17:53:12C" level=info msg="SHA1 Fingerprint=7F:D0:19:C5:80:42:66"
Now I want to retrieve the value of fingerprint and assign it to chef attribute
I am using the following ruby block
ruby_block "retrieve_fingerprint" do
block do
path="#{node['abc.d']}/fingreprint.txt"
Chef::Resource::RubyBlock.send(:include, Chef::Mixin::ShellOut)
command = 'grep -Po '(?<=Fingerprint=)[^"]*' path '
command_out = shell_out(command)
node.default['fingerprint'] = command_out.stdout
end
action :create
end
It seems not to be working because of missing escape chars in command = 'grep -Po '(?<=Fingerprint=)[^"]*' path '.
Please let me know if there is some other way of assigning file content to node attribute
Two ways to answer this: first I would do the read (IO.read) and parsing (RegExp.new and friends) in Ruby rather than shelling out to grep.
if IO.read("#{node['abc.d']}/fingreprint.txt") =~ /Fingerprint=([^"]+)/
node.default['fingerprint'] = $1
end
Second, don't do this at all because it probably won't behave how you expect. You would have to take in to account both the two-pass loading process and the fact that default attributes are reset on every run. If you're trying to make an Ohai plugin, do that instead. If you're trying to use this data in later resources, you'll probably want to store it in a global variable and make copious use of the lazy {} helper.
I am using the -O or --output-document option on wget to store get the http from a website. However the -O option requires a file for the output to be stored in and I would like to store it in a variable in my program so that I can manipulate it easier. Is there any way to do this without rereading it in from the file? In essence, I am manually creating a crude cache.
Sample code
#!/usr/bin/ruby
url= "http://www.google.com"
whereIWantItStored = `wget #{url} --output-document=outsideFile`
Reference:
I found this post helpful in using wget within my program: Using wget via Ruby on Rails
#!/usr/bin/ruby
url= "http://www.google.com"
whereIWantItStored = `wget #{url} -O -`
Be sure to sanitize your url to avoid shell injection. The - after -O means standard output, which gets captured by the ruby backticks.
https://www.owasp.org/index.php/Command_Injection explains shell injection.
http://apidock.com/ruby/Shellwords/shellescape For Ruby >=1.9 or the Escape Gem for ruby 1.8.x
I wouldn't use wget. I'd use something like HTTParty.
Then you could do:
require 'httparty'
url = 'http://www.google.com'
response = HTTParty.get(url)
whereIWantItStored = response.code = 200 ? response.body : nil
I made a very small app for the raspberry pi, that uses Sinatra:
https://github.com/khebbie/SpeakPi
The app lets the user input some text in a textarea and asks Google to create an mp3 file for it.
In there I have a shell script called speech2.sh which calls Google and plays the mp3 file:
#!/bin/bash
say() {
wget -q -U Mozilla -O out.mp3 "http://translate.google.com/translate_tts?tl=da&q=$*";
local IFS=+;omxplayer out.mp3;
}
say $*
When I call speech.sh from the commandline like so:
./speech2.sh %C3%A6sel
It pronounces %C3%A6 like the danish letter 'æ', which is correct!
I call speech2.sh from a Sinatra route like so:
post '/say' do
message = params[:body]
system('/home/pi/speech2.sh '+ message)
haml :index
end
And when I do so Google pronounces some very weird chars like 'a broken pipe...' which is wrong!
All chars a-z are pronounced correctly
I have tried some URL encoding and decoding, nothing worked.
I tried outputting the message to the command-line and it was exactly "%C3%A6" that certainly did not make sense.
Do you have any idea what I am doing wrong?
EDIT
To Sum it up and simplify - if I type like so in bash:
./speech2.sh %C3%A6sel
It works
If I start an irb session and type:
system('/home/pi/speech2.sh', '%C3%A6sel')
It does not work!
Since it is handling UTF-8, make sure that the encoding remains right the way through the process by adding the # encoding: UTF-8 magic comment at the top of the Ruby script and passing the ie=UTF-8 parameter in the query string when calling Google Translate.