I would like to convert the data scraping from an internet site, regarding the time, the data is extracted like this (for example 9:15) and inserted into the cell, I would like at the bottom of the column to make the total of the hours, the problem I would like python to convert it to numerical format so that I can add it up.
any idea?
def excel():
# Writing on a EXCEL FILE
filename = f"Monatsplan {userfinder} {month} {year}.xlsx"
try:
wb = load_workbook(filename)
ws = wb.worksheets[0] # select first worksheet
except FileNotFoundError:
headers_row = [
"Datum",
"Tour",
"Funktion",
"Von",
"Bis",
"Schichtdauer",
"Bezahlte Zeit",
]
wb = Workbook()
ws = wb.active
ws.append(headers_row)
wb.save(filename)
ws.append(
[
datumcleaned[:10],
tagesinfo,
"",
"",
"",
"",
"",
]
)
wb.save(filename)
wb.close()
excel()
You should split the data you scrapped.
time_scrapped = '9:15'
time_split = time_scrapped.split(":")
hours = int(time_split[0])
minutes = int(time_split[1])
Then you can place it in separate columns and create formula at the bottom of the column.
I'm learning Scrapy. As an exercise I want to get the product title in this web page https://scrapingclub.com/exercise/detail_json/ using this code:
scrapy shell "https://scrapingclub.com/exercise/detail_json/"
response.xpath("//h3[1]/text()")
[]
but the only thing I get is nothing (a zero dim dic).
Try this,
response.xpath("//script[contains(., 'title')]/text()")
if you press control+u you can see this information at the footer
var obj = {
"title": "Short Sweatshirt",
"price": "$24.99",
"description": "Short sweatshirt with long sleeves and ribbing at neckline, cuffs, and hem. 57% cotton, 43% polyester. Machine wash
cold.",
"img_path": "/static/img/" + "96230-C" + ".jpg" };
https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query
Google Analytics tells me ~20k pages were visited from Google search, but the Google Search Console API returns just under 5k urls. Changing startRow doesn't help.
What's really odd is I've connected Google Search Console to Google Analytics, and when viewing GSC data in GA (Acquisition -> Search Console -> Landing Pages) the GSC data there also gives me ~20k rows.
How do I get all ~20k rows out of the Google Search Console API?
date_str = '2017-12-20'
start_index = 0
row_limit = 5000
next_index = 5000
rows = []
while next_index == row_limit:
req = webmasters_service.searchanalytics().query(
siteUrl='https://tenor.com/',
fields='responseAggregationType,rows',
body={
"startDate": date_str,
"endDate": date_str,
"searchType": search_type,
"dimensions": [
"page",
],
"rowLimit": row_limit,
"startRow": start_index,
},
)
try:
resp = req.execute()
next_rows = resp['rows']
rows += next_rows
next_index = len(next_rows)
start_index += next_index
except Exception as e:
print(e)
break
return rows
For anyone else viewing this post:
When I view 'Search Results' under 'Performance' in the sidebar of the google search console webpage, at the end of the url there is a variable 'resource_id=sc-domain%3example.com'. I recommend using this resource_id variable as your siteURL.
I am using this page:
https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8
I am trying to get the this element: class="_XWk"
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('_XWk')
Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?
Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.
It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.
In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.
Pass a user-agent:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
HTTParty.get("https://www.google.com/search", headers: headers)
Iterate over container to extract titles from Google Search:
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
Code and example in the online IDE:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "ford fusion msrp",
num: "20"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.
Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.
Example code:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "ford fusion msrp",
hl: "en",
num: "20"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.
It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('._tA').map(&:text)
# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]
Change parse_page.css('_XWk') to parse_page.css('._XWk')
Note the dot (.) difference. The dot references a class.
Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..
I'm a Java guy, new to Ruby. I've been playing with it just to see what it can do, and I'm running into an issue that I can't solve.
I decided to try out Sinatra, again, just to see what it can do, and decided to play with the ESPN API and see if I can pull the venue of a team via the API.
I'm able to make the call and get the data back, but I am having trouble parsing it:
{"sports"=>[{"name"=>"baseball", "id"=>1, "uid"=>"s:1", "leagues"=>[{"name"=>"Major League Baseball", "abbreviation"=>"mlb", "id"=>10, "uid"=>"s:1~l:10", "groupId"=>9, "shortName"=>"MLB", "teams"=>[{"id"=>17, "uid"=>"s:1~l:10~t:17", "location"=>"Cincinnati", "name"=>"Reds", "abbreviation"=>"CIN", "color"=>"D60042", "venues"=>[{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}], "links"=>{"api"=>{"teams"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17"}, "news"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"}, "notes"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}}, "web"=>{"teams"=>{"href"=>"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}}, "mobile"=>{"teams"=>{"href"=>"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}], "resultsOffset"=>0, "resultsLimit"=>50, "resultsCount"=>1, "timestamp"=>"2013-08-04T14:47:13Z", "status"=>"success"}
I want to pull the venues part of the object, specifically the name value. Every time I try to parse it I end up getting an error along the lines of "cannot change from nil to string" and then also I've gotten an integer to string error.
Here's what i have so far:
get '/venue/:team' do
id = ids[params[:team]]
url = 'http://api.espn.com/v1/sports/baseball/mlb/teams/' + id + '?enable=venues&apikey=' + $key
resp = Net::HTTP.get_response(URI.parse(url))
data = resp.body
parsed = JSON.parse(resp.body)
#venueData = parsed["sports"]
"Looking for the venue of the #{params[:team]}, which has id " + id + ", and here's the data returned: " + venueData.to_s
end
When I do parsed["sports"} I get:
[{"name"=>"baseball", "id"=>1, "uid"=>"s:1", "leagues"=>[{"name"=>"Major League Baseball", "abbreviation"=>"mlb", "id"=>10, "uid"=>"s:1~l:10", "groupId"=>9, "shortName"=>"MLB", "teams"=>[{"id"=>17, "uid"=>"s:1~l:10~t:17", "location"=>"Cincinnati", "name"=>"Reds", "abbreviation"=>"CIN", "color"=>"D60042", "venues"=>[{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}], "links"=>{"api"=>{"teams"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17"}, "news"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"}, "notes"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}}, "web"=>{"teams"=>{"href"=>"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}}, "mobile"=>{"teams"=>{"href"=>"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}]
But nothing else parses. Please help!
Like I said, I'm not trying to do anything fancy, just figure out Ruby a little for fun, but I have been stuck on this issue for days now. Any help would be appreciated!
EDIT:
JSON straight from the API:
{"sports" :[{"name" :"baseball","id" :1,"uid" :"s:1","leagues" :[{"name" :"Major League Baseball","abbreviation" :"mlb","id" :10,"uid" :"s:1~l:10","groupId" :9,"shortName" :"MLB","teams" :[{"id" :17,"uid" :"s:1~l:10~t:17","location" :"Cincinnati","name" :"Reds","abbreviation" :"CIN","color" :"D60042","venues" :[{"id" :83,"name" :"Great American Ball Park","city" :"Cincinnati","state" :"Ohio","country" :"","capacity" :0}],"links" :{"api" :{"teams" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17"},"news" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"},"notes" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}},"web" :{"teams" :{"href" :"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}},"mobile" :{"teams" :{"href" :"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}],"resultsOffset" :0,"resultsLimit" :50,"resultsCount" :1,"timestamp" :"2013-08-05T19:44:32Z","status" :"success"}
The result of data.inspect:
"{\"sports\" :[{\"name\" :\"baseball\",\"id\" :1,\"uid\" :\"s:1\",\"leagues\" :[{\"name\" :\"Major League Baseball\",\"abbreviation\" :\"mlb\",\"id\" :10,\"uid\" :\"s:1~l:10\",\"groupId\" :9,\"shortName\" :\"MLB\",\"teams\" :[{\"id\" :17,\"uid\" :\"s:1~l:10~t:17\",\"location\" :\"Cincinnati\",\"name\" :\"Reds\",\"abbreviation\" :\"CIN\",\"color\" :\"D60042\",\"venues\" :[{\"id\" :83,\"name\" :\"Great American Ball Park\",\"city\" :\"Cincinnati\",\"state\" :\"Ohio\",\"country\" :\"\",\"capacity\" :0}],\"links\" :{\"api\" :{\"teams\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17\"},\"news\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news\"},\"notes\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes\"}},\"web\" :{\"teams\" :{\"href\" :\"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public\"}},\"mobile\" :{\"teams\" :{\"href\" :\"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public\"}}}}]}]}],\"resultsOffset\" :0,\"resultsLimit\" :50,\"resultsCount\" :1,\"timestamp\" :\"2013-08-05T19:44:24Z\",\"status\" :\"success\"}"
parsed["sports"] does not exist, parse your input and inspect it/ dump it
With the data you've provided in the question, you can get to the venues information like this:
require 'json'
json = JSON.parse data
json["sports"].first["leagues"].first["teams"].first["venues"]
# => [{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}]
By replacing each of the first calls with an iterator, you can search through without knowing where the data is:
json["sports"].each{|h|
h["leagues"].each{|h|
h["teams"].each{|h|
venues = h["venues"].map{|h| h["name"]}.join(", ")
puts %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
}
}
This outputs:
name: Cincinnati Reds venues: Great American Ball Park
Depending on how stable the response data is you may be able to cut out several of the iterators:
json["sports"].first["leagues"]
.first["teams"]
.each{|h|
venues = h["venues"].map{|h| h["name"] }.join(", ")
puts %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
and you'll most likely want to save the data, so something like each_with_object is helpful:
team_and_venues = json["sports"].first["leagues"]
.first["teams"]
.each_with_object([]){|h,xs|
venues = h["venues"].map{|h| h["name"]}.join(", ")
xs << %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
# => ["name: Cincinnati Reds venues: Great American Ball Park"]
team_and_venues
# => ["name: Cincinnati Reds venues: Great American Ball Park"]
Notice that when an iterator declares variables, even if there is a variable with the same name outside the block, the scope of the block is respected and the block's variables remain local.
That's some pretty ugly code if you ask me, but it's a place to start.