I am trying to get the transcription info from some Khan Academy videos using scrapy.
For example: https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number
When I Tried to select the Transcript button through xpath response.xpath('//div[contains(#role, "tablist")]/a').extract() I only got the information about the tab has the aria-selected="true" which is the About section. I would need to use scrapy to change the aria-selected from false to true in the Transcript button and then retrieve the necessary information.
Could anyone please clarify how I would be able to accomplish this?
Much appreciated !
If you take a look at your network inspect you can see that an AJAX request is being made to retrieve the transcript once the page loads:
In this case it's https://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en
It seems to use youtube video url id to create this api url. So you can recreate it really easily:
import json
import scrapy
class MySpider(scrapy.Spider):
#...
transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en'
def parse(self, response):
# find youtube id
youtube_id = response.xpath("//meta[#property='og:video']/#content").re_first('v/(.+)')
# create transcript API url using the youtube id
url = self.transcript_url_template.format(youtube_id)
# download the data and parse it
yield Request(url, self.parse_transript)
def parse_transcript(self, response):
# convert json data to python dictionary
data = json.loads(response.body)
# parse your data!
Related
I am using Ruby for my telegram bot. People send some information to bot, and some times it is photo. I Need to save this photo to my computer in directory with Ruby file and then send this photo another people if he or she is needed of it.
So, How do I download a photo that was sent to my Telegram bot? I know about getFile from official site of Telegram and about this question How do I download a file or photo that was sent to my Telegram bot? but i don't understand how to use it with Ruby, because i'm start ruby only 5 months ago.
I try to wrote different code, but all doesn't work...
bot.messages_handler(content_types=['photo'])
bot.send_message(message.chat_id,bot.get_file_url(message.photo[0].file_id))
I'm hope for your help.
UPDATE!
So, now after codding and analisis all information and answer I have code which can find 'file_id' of photo from message of user and use this 'file_id' for sending this photo to another users.
if message.photo
info_about_phot = message.photo.to_s #Hash all info about photo to string
for i in ((info_about_phot.index('#file_id="')+10)..
(info_about_phot.index('", #file_unique_id=')-1)) #find all simbols of 'file_id' inside info about photo
info_about_phot = info_about_phot + message.photo.to_s[i] #mabe 'file_id from all simbols'
end
end
bot.api.send_photo(chat_id: message.from.id, photo: info_about_phot) #send message with only photo
may be it is not so clear as variant of #mechnicov but it works perfect and help me with my problem. But if some body can write better code - i will say "Thanks!!!".
You didn't specify used gem
Just as example using telegram-bot-ruby
You can pass photo file_id from message to getFile method and get file information, that has file_path attribute
photo_url = bot.api.get_file(message.photo.last.file_id).file_path
It is photo URL. After that you can download file different ways with Ruby, for example
require "open-uri"
io = URI.open(photo_url)
File.open(path_to_save, 'w') { |f| f.write(io.read) }
I'm trying to retrieve a single sheet from a google spreadsheet in excel format, I have all the access setup correctly and can run different google sheet v4 api functions on it.
I wanted to use the Google::Apis::SheetsV4::SheetsService::copy_spreadsheet function to copy a single sheet as mentioned in the Ruby example here - https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.sheets/copyTo
This is my code -
service = Google::Apis::SheetsV4::SheetsService.new
service.client_options.application_name = APPLICATION_NAME
service.authorization = authorize
spreadsheet_id = "<passing my spreadsheet id here>"
gid = "<setting this as my sheet id from the spreadsheet>"
request = Google::Apis::SheetsV4::CopySheetToAnotherSpreadsheetRequest.new(
destination_spreadsheet_id: "0",
)
response1 = service.copy_spreadsheet(spreadsheet_id,gid,request)
puts response1.to_json
This always fails with the following error -
/usr/local/lib/ruby/gems/3.1.0/gems/google-apis-core-0.4.2/lib/google/apis/core/http_command.rb:229:in `check_status': badRequest: Invalid destinationSpreadsheetId [0] (Google::Apis::ClientError)
from /usr/local/lib/ruby/gems/3.1.0/gems/google-apis-core-0.4.2/lib/google/apis/core/api_command.rb:134:in `check_status'
Would be great if someone can help me on how to use this properly, also if there's a better way to download/export a single sheet from a spreadsheet in Ruby let me know.
Answer for question 1
This always fails with the following error -
/usr/local/lib/ruby/gems/3.1.0/gems/google-apis-core-0.4.2/lib/google/apis/core/http_command.rb:229:in `check_status': badRequest: Invalid destinationSpreadsheetId [0] (Google::Apis::ClientError)
from /usr/local/lib/ruby/gems/3.1.0/gems/google-apis-core-0.4.2/lib/google/apis/core/api_command.rb:134:in `check_status'
Would be great if someone can help me on how to use this properly,
When I saw your error message and your script, I thought that destination_spreadsheet_id: "0", is not correct. In this case, please set the destination Spreadsheet ID. When this is reflected in your script, it becomes as follows.
src_spreadsheet_id = "###" # Please set the source Spreadsheet ID.
src_sheet_id = "###" # Please set the sheet ID of the source Spreadsheet.
dst_spreadsheet_id = "###" # Please set the destination Spreadsheet ID.
request = Google::Apis::SheetsV4::CopySheetToAnotherSpreadsheetRequest.new(
destination_spreadsheet_id: dst_spreadsheet_id,
)
response1 = service.copy_spreadsheet(src_spreadsheet_id,src_sheet_id,request)
puts response1.to_json
Answer for question 2
I'm trying to retrieve a single sheet from a google spreadsheet in excel format, I have all the access setup correctly and can run different google sheet v4 api functions on it.
also if there's a better way to download/export a single sheet from a spreadsheet in Ruby let me know.
In this case, how about the following sample script? In this script, a XLSX data including the specific sheet is downloaded using the endpoint of https://docs.google.com/spreadsheets/d/{spreadsheetId}/export?format=xlsx&gid={sheetId}. So, please set your Spreadsheet ID and sheet ID to the URL. And, in this case, the access token is retrieved from service you are using.
url = 'https://docs.google.com/spreadsheets/d/{spreadsheetId}/export?format=xlsx&gid={sheetId}'
filename = 'sample.xlsx' # Please set the saved filename.
access_token = service.request_options.authorization.access_token
open(
url,
"Authorization" => "Bearer " + access_token,
:redirect => true
) do |file|
open(filename, "w+b") do |out|
out.write(file.read)
end
end
When this script is run, the specific sheet of sheetId of spreadsheetId is downloaded as a XLSX data and save it.
This script uses require "open-uri".
Note:
When the Spreadsheet is downloaded as a XLSX data, when an error related to the scope, please add https://www.googleapis.com/auth/drive.readonly to the scopes and reauthorize again. By this, the script works.
Reference:
Method: spreadsheets.sheets.copyTo
Im trying to make an app which would iterate through my own posts and get a list of users who favorited a post. Afterwards I would like the application to follow each of those users if I am not already following them. I am using Ruby for this.
This is my code now:
#client = Twitter::REST::Client.new(config)
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
user = #client.user()
tweets = #client.user_timeline(user).take(20)
num_of_tweets = tweets.length
puts "tweets found: #{tweets.length}"
tweets.each do |item|
puts "#{ item}" #iterating through my posts here
end
any suggestions?
That information isn't exposed in the Twitter API, either through a timeline collection or via the endpoint representing a single tweet. This'll be why the twitter gem, which provides a useable interface around the Rest API, cannot give you what you're after.
Third party sites such as Favstar do display that information, but as far as I know their own API does not expose the relevant users in any manageable way.
I used the first answer to this question in order to adapt it to my need: saving pictures of a given URL on my laptop automatically. My problem is how to get the URI of every image that exist on the webpage in order to complete my code correctly:
import selenium
class TestFirefox:
def testFirefox(self):
self.driver=webdriver.Firefox()
# There are 2 pictures on google.com, I want to download them
self.driver.get("http://www.google.com")
self.l=[] # List to store URI to my images
self.r=self.driver.find_element_by_tag_name('img')
# I did print(self.r) but it does not reflect the URI of
# the image: which is what I want.
# What can I do to retrieve the URIs and run this:
self.l.append(self.image_uri)
for uri_to_img in self.l:
self.driver.get(uri_to_img)
# I want to download the images, but I am not sure
# if this is the good way to proceed since my list's content
# may not be correct for the moment
self.driver.save_screenshot(uri_to_image)
driver.close()
if __name__=='__main__':
TF=TestFirefox()
TF.testFirefox()
You need to get get src attribute of the given image in order to determine it's name and (possibly) address - remember, src can be also relative URI.
for img in self.l:
url = img.get_attribute("src")
For downloading image you should try simple HTTP client like urllib
import urllib.request
urllib.request.urlretrieve(url, "image.png")
Newbie here, running scrapy in windows. How to avoid showing the extracted links and crawled items in the command window? I found comments in the "parse" section on this linkhttp://doc.scrapy.org/en/latest/topics/commands.html, not sure if it's relevant and how to apply it if so. Here is more detail with part of the code, starting from my second Ajax request (In the first Ajax request, the callback function is "first_json_response":
def first_json_response(self, response):
try:
data = json.loads(response.body)
meta = {'results': data['results']}
yield Request(url=url, callback=self.second_json_response,headers={'x-requested-with': 'XMLHttpRequest'}, meta = meta)
def second_json_response(self, response):
meta = response.meta
try:
data2 = json.loads(response.body)
...
The "second_json_response" is to retrieve the response from the requested result in first_json_response, as well as to load the new requested data. "meta" and "data" are then both used to define items that need to be crawled. Currently, the meta and links are shown in the windows terminal where I submitted my code. I guess it is taking up some extra time for computer to show them on the screen, and thus want them to disappear. I hope by running scrapy on a kinda-of batch mode will speed up my lengthy crawling process.
Thanks! I really appreciate your comment and suggestion!
From scrapy documentation:
"You can set the log level using the –loglevel/-L command line option, or using the LOG_LEVEL setting."
So append to your scray crawl etc command -loglevel='ERROR' . That should make all the info disappear from your command line, but I don't think this will speed things much.
In your pipelines.py file, try using something like:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
This way, when you yield an item from your spider class, it will print it out to items.jl.
Hope that helps.