How to check if Instagram image has been removed - ruby

I've been trying to detect if Instagram image has been removed to hide those photos in my DB.
Right now I'm storing Instagram image short codes. And accessing images with "https://instagram.com/p/#{shortcode}"
And if you access for example legit url with (Ruby):
open "https://instagram.com/p/1" then it returns 200 OK,
on the other hand random not existing page throws exception 404.
But sometimes it seems to throw 404 on legitimate page, thats why my code does things which it shouldn't do on them.
begin
link = "https://instagram.com/p/#{submission.image}"
submission.visible = true
open link, :allow_redirections => :all
rescue OpenURI::HTTPError => e
if e.message.include? '404 NOT FOUND'
submission.visible = false
end
end
Do you have any ideas?

You might need to use the API. Please check the media endpoints of Instagram's API if it helps:
https://instagram.com/developer/endpoints/media/#get_media

Related

what might be the reason for selenium working for xpath sometimes but sometimes fail to identify the xpath?

def linkdin_login(company_name,username,password):
driver.get('https://linkedin.com/')
driver.find_element(By.XPATH,'//*[#id="session_key"]').send_keys(username)
driver.find_element(By.XPATH,'//*[#id="session_password"]').send_keys(password)
driver.find_element(By.XPATH,"//button[#class='sign-in-form__submit-button']").click()
#def company_info(company_name):
element = driver.find_element(By.CSS_SELECTOR,"#global-nav-typeahead > input")
element.send_keys(company_name)
element.send_keys(Keys.ENTER)
driver.implicitly_wait(10) # seconds
driver.get(driver.find_element(By.CSS_SELECTOR,".search-nec__hero-kcard-v2 > a:nth-child(1)").get_attribute("href"))
driver.implicitly_wait(10)
people()
by the above code i am logging into LinkedIn and fetching the LinkedIn page of the some companies after getting the page I am trying to get the employee data by using people function show below
def people():
driver.implicitly_wait(10)
driver.get(driver.find_element(By.XPATH,"/html/body/div[5]/div[3]/div/div[2]/div/div[2]/main/div[1]/section/div/div[2]/div[1]/div[2]/div/a").get_attribute("href"))
driver.implicitly_wait(10)
people = driver.find_element(By.XPATH,"/html/body/div[4]/div[3]/div[2]/div/div[1]/main/div/div/div[2]/div/ul")
people_data = people.find_elements(By.TAG_NAME,"li")
for i in people_data:
print(i.text)
in this function i am trying to access the link to employees data
that is where the problem lies
the line 2 of people function i trying to get the link the problem is due to some reason sometimes i am getting the link(not to frequently!!) but most of the time i am getting the error saying Xpath not found
i didn't know how to attach a html page so i am attaching the link
([https://www.linkedin.com/company/google/](https://www.stackoverflow.com/)
1. I tried implicit wait assuming that the program is trying to access the Xpath during loading of the page

How to check that a PDF file has some link with Ruby/Rspec?

I am using prawnpdf/pdf-inspector to test that content of a PDF generated in my Rails app is correct.
I would want to check that the PDF file contains a link with certain URL. I looked at yob/pdf-reader but haven't found any useful information related to this topic
Is it possible to test URLs within PDF with Ruby/RSpec?
I would want the following:
expect(urls_in_pdf(pdf)).to include 'https://example.com/users/1'
The https://github.com/yob/pdf-reader contains a method for each page called text.
Do something like
pdf = PDF::Reader.new("tmp/pdf.pdf")
assert pdf.pages[0].text.include? 'https://example.com/users/1'
assuming what you are looking for is at the first page
Since pdf-inspector seems only to return text, you could try to use the pdf-reader directly (pdf-inspector uses it anyways).
reader = PDF::Reader.new("somefile.pdf")
reader.pages.each do |page|
puts page.raw_content # This should also give you the link
end
Anyway I only did a quick look at the github page. I am not sure what raw_content exactly returns. But there is also a low-level method to directly access the objects of the pdf:
reader = PDF::Reader.new("somefile.pdf")
puts reader.objects.inspect
With that it surely is possible to get the url.

Logging Into Google To Scrape A Private Google Group (over HTTPS)

I'm trying to log into Google, so that I can scrape & migrate a private google group.
It doesn't seem to log in over SSL. Any ideas appreciated. I'm using Mechanize and the code is below:
group_signin_url = "https://login page to goolge, with referrer url to a private group here"
user = ENV['GOOGLE_USER']
password = ENV['GOOGLE_PASSWORD']
scraper = Mechanize.new
scraper.user_agent = Mechanize::AGENT_ALIASES["Linux Firefox"]
scraper.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = scraper.get group_signin_url
google_form = page.form
google_form.Email = user
google_form.Passwd = password
group_page = scraper.submit(google_form, google_form.buttons.first)
pp group_page
I worked with Ian (the OP) on this problem and just felt we should close this thread with some answers based on what we found when we spent some more time on the problem.
1) You can't scrape a Google Group with Mechanize. We managed to get logged in abut the content of the Google Group pages is all rendered in-browser, meaning that HTTP requests, such as issued by Mechanize, are returned with a few links and no actual content.
We found that we could get page content by the use of Selenium (we used Selenium in Firefox, using the Ruby bindings).
2) the HTML element IDs/classes in Google Groups are obfuscated but we found that these Selenium commands will pull out the bits you need (until Google change them)
message snippets (click on them to expand messages)
find_elements(:class, 'GFP-UI5CCLB')
elements with name of author
find_elements(:class, 'GFP-UI5CA1B')
elements with content of post
find_elements(:class, 'GFP-UI5CCKB')
elements containing date
find_elements(:class, 'GFP-UI5CDKB') (and then use the attribute[:title] for a full length date string)
3) I have some Ruby code here which scrapes the content programmatically and uploads it into a Discourse forum (which is what we were trying to migrate to).
It's hacky but it kind of works. I recently migrated 2 commercially important Google Groups using this script. I'm up for taking on 'We Scrape Your Google Group' type work, please PM me.

building custom url validator - how to grab new url out of a 'moved' url response?

so I do this
require 'net/http';
require 'net/smtp';
res = Net::HTTP.get_response(URI.parse("http://www.cifs.dk"));
and res.response.msg tells me 302 - the site has moved.
How do I get the full address that it was moved to? (http://www.cifs.dk/en)
res.methods shows me a bunch of things to try but no luck yet.
The closest I've found is
res.response.body, but that just gives me
... </h1>This object may be found here ...
which would be no fun at all to try to piece together.
The Location header is what you are looking for:
response['Location']

How to update address of a facebook page through API

I am trying to update address/location of a facebook business page through API using koala ruby gem, so far no working solution.
page_access_token = "gw4t3434"
page_api = Koala::Facebook::API.new(page_access_token)
page_api.graph_call('me', {:location => {:street => "my street"}}, 'post') #error. Koala::Facebook::APIError: OAuthException: (#100) Parameters do not match any fields that can be updated
page_api.graph_call('me', {:location => {:address => "my street"}}, 'post') #error. Koala::Facebook::APIError: OAuthException: (#100) Parameters do not match any fields that can be updated
page_api.graph_call('me', {:address => "my street"}}, 'post')# not raise error but not working
page_api.graph_call('me', {:street => "my street"}}, 'post')# not raise error but not working
I can not find clear explanation either in facebook api reference regarding updating address in a page. I may missing something...
You can't write to the location object, only read. See "Updating Page Attributes" in the API. Also, there is no permission to request for writing to a location object.
An alternative is that you write to the Page's about section - this is allowed. Perhaps you can place an address reference here to meet the requirement of making address changes visible to the end user.

Resources