Extracting web address and using for loop - for-loop

I am trying to extract websites of the members from https://www.mhi.org/members. So, I wrote a code to visit the member page one by one and extract the web addresses. I am using BeautifulSoup Library to extract.
However my problem is not implementation of BeautifulSoup but in the for loop. The above link has 15 members per page. When I try to run the code for one page it just returns the web address of only one member. I have imported all the desired libraries.
url = input('Enter URL:')
#position = int(input('Enter position:'))-1
html = urlopen(url).read()
lst1 = list()
lst = list()
lst2 = list()
lst3 = list()
lst4 = list()
soup = BeautifulSoup(html,"html.parser")
url1 = "https://www.mhi.org"
conn = sqlite3.connect('list.sqlite')
cur = conn.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS weblinks (URL TEXT UNIQUE)''')
cur.execute('''CREATE TABLE IF NOT EXISTS weblinks1 (website TEXT UNIQUE)''')
cur.execute('''CREATE TABLE IF NOT EXISTS weblinks2 (website1 TEXT UNIQUE)''')
#print(soup)
for link in soup.find_all('a'):
lst.append(link.get('href'))
for links in lst:
if 'members' in links:
url2 = urllib.parse.urljoin(url1, links)
lst2.append(url2)
#print(lst2)
url3 = lst2[4:18]
print(url3)
for x in url3:
html1 = urlopen(x).read()
soup1 = BeautifulSoup(html1,"html.parser")
for url4 in soup1.find_all('a'):
url4 = url4.get('href')
if url4 not in lst3:
lst3.append(url4)
url6 = lst3.pop(108)
print(url6)
So, the last "for loop" where url6 is the desired output. It just prints output one web address.
Please advise what am I missing here.

Related

Iterating through Beautiful Soup

I have a list of urls and I am trying to scrape each of them to output in a single data frame. my code is below:
res=[]
for link in url_list:
html = urlopen(link)
soup = BeautifulSoup(html,'html.parser')
headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]
headers = headers[1:]
rows = soup.findAll('tr')[2:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns = headers)
I only get output for one web page. Why is that?
You are overwriting your dataframe each time in your for loop. If you have a lot of pages, you can collect the individual dataframes and concatenate them.
df_hold_list = []
for link in url_list:
html = urlopen(link)
soup = BeautifulSoup(html,'html.parser')
headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]
headers = headers[1:]
rows = soup.findAll('tr')[2:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
df_hold_list.append(pd.DataFrame(player_stats, columns = headers))
stats = pd.concat(df_hold_list)

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop,
# This package will contain the spiders of your Scrapy project
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[#name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[#name = classname]/text()")
I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...
PS.
I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes
with filter applied: Kingsborough CC, fall 18, BIO
Thanks!
Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...
So I will give you some tools:
In your settings.py set that:
LOG_FILE = "log.txt"
LOG_STDOUT=True
then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt
Check there if there is what you are looking for.
If there is, use this https://www.freeformatter.com/xpath-tester.html (
or similar ) to check your xpath statement.
In addition, change for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
by for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

For loop while using scrapy

I am trying to crawl a website. I want to do it for different dates. So i am storing date in a list. But while trying to access items of list, crawler works only for 1st value in list. Please help. following is my code:
class SpidyQuotesViewStateSpider(scrapy.Spider):
name = 'retail_price'
def start_requests(self):
print "start request"
urls = "http://fcainfoweb.nic.in/PMSver2/Reports/Report_Menu_web.aspx"
yield scrapy.Request(url=urls, callback=self.parse)
def parse(self, response):
dated = ["05/03/2017","04/03/2017"]
urls = "http://fcainfoweb.nic.in/PMSver2/Reports/Report_Menu_web.aspx"
#frmdata =
cookies1 ={}
val = response.headers.getlist('Set-Cookie')
print "login session values" ,response.headers.getlist('Set-Cookie')
if(len(val) != 0):
cookies1['ASP.NET_SessionId'] = str(response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1])
cookies1['path'] = str(response.headers.getlist('Set-Cookie')[0].split(";")[1].split("=")[1])
print cookies1;
for i in range(len(dated)):
yield scrapy.FormRequest(url=urls, callback=self.parse1, formdata={'ctl00$MainContent$btn_getdata1':"Get Data",
'ctl00$MainContent$Txt_FrmDate':dated[i],
'ctl00$MainContent$Ddl_Rpt_Option0':"Daily Prices",
'ctl00$MainContent$Rbl_Rpt_type':"Price report",
'ctl00$MainContent$ddl_Language':"English",
'ctl00$MainContent$Ddl_Rpt_type':"Retail",
'__EVENTVALIDATION':"IwZyKgfTXVzxiHxiPXGk/W8XQZBDb0EOPxJh6s8hofq0ffqOpiHSH77CafcxySF3PbkYgSMNFCJhLM2cGnL6SxT0PJuGDCJtV0V8Y4a94UErUCiSANiin+4uKckk9v9Ux8JqTVeaipppmlH+wyks2U9SgPfkNUsqw4eHCkDyB5akNNZImRIixOHHVY3JSXGkwXn7ueK9w+AgnqJzpXaWdMr9J1++M4VAFImSNF8brFSfPHe5kb/qzkGIwUr/KRouaRYK8WLWZh/Mbl9xwREwhDSxWJSOdihSE0WWoaqSMtpaR99rDDCsD3mdJqfu0aPIlREupTZRzlrmztXU0eS3949YW+ywdTRvykaMNgOW2Q4saYP5j/niKbRW6GiDnaLV2A38X/HW80+trrsjwJr9tjTKVFyikf6s/3gzyiTp11ivSkwIY2b3hutjYn7OfTDo",
#'__EVENTVALIDATION':"HqVo2xHk04clYwnBposXbZGhbIr181A7RbyeZv74Cia7rXSKmpOpbeSnn3XXnoDJKRxMK0W9nxKZFfkNje+P/K7gE5HVjHJr9Gr0Gs46TntzKDsvzyii8jZ7e0fdZgQCJKoXxQNgR2vNkWqChKcEldBuMHCOgJRqCNCF/JPFKpdKZoIWr7GU8rhzwLijf/Gkm+FuTULs/fl2HHK6Z1QQEozzEHFsDwzl0G4IiN//eNYfHuUBXKZ3wdZzPqG0s53WHEuSBzhqBC9AtCJOs4ZZhdtwFh8iyTJ4PlsLP9DLHYHRCOAd72UO0UH8gT7gAkKVo1I4L540DilowOR9SttH7MM/oOs9qhKlnG61FgqkYGW8zGzF/yNEXO+beVAK1RVvuO+FDnuq/g36TRnUieei5GpAZ+96CSoCIxykdvHx8R+smTNF/5erlowV4ci+tcI7",
'__VIEWSTATEENCRYPTED':"",
'__VIEWSTATEGENERATOR':"85862B00",
'__VIEWSTATE':"+a+3jrBEKxDdkPOzx2wXwKaTMWvCB60WPaHRfJUAZQrdFIpxSqFr5VseTclpGzeHXdxaFnxJe/PkxDKYa7sj3Wiv/os1bNeX0IEB3s45eFsHYWGiU8cvsXCGa5z7rrGRDL5hotg7k/MuUWj8w27xXZO423MN5OsHS+wh+tC/5/Xix+w3zxuQhi8jR5DnreimHbhGZn1sYaKYIGCc8mDIDRNl+w1OZ058F+3LAx96QUu5BYiMYOmrlyxrb9b2yPTmmIrI4NtC4ClBQlxuST5wMDP3vUqqWMhn4auk8ev5gHyPestCRrsAXWs07wDNnikemMwo/4wPiTEbnZQV6SLcDUw0gZpXjXwLI7mhsVjEyVNaQnJp6+Wi6FLsAEEMlFYmQut3JecpVIUkjF9uYSN2GLIbXHPs37AiEXPeQ8E/GyBMx3z1X5l8sw/xSNmFgYQC3riajn8V0+SdkuV2PbNbYKtc+uoSCNLppLYCqiOv5eWanGvAQro2Q67FBA4w2xY+V/K8mzHaGMLoDBxJxLslWyJpL5cX0C6qoXVUu8B028auAQM4eVzH1YPF5qrJiCDo",
#'__VIEWSTATE':"W+m8kNAS6QHiRPo+zFj00EDs/Dbq+y/XvtCmSNwOIkGKlikAlphT8HBAWQDskSm1vdNterBuo0Hy7m4xPbXMOnyEm6IlseXO3jPw+ofnI2WHAKknLil+GeS0IfMWGeoD5aNyiz3zh1jZkKU7R7hQsxwARoHRyjhf8UCooFbkVvL6ddHVYZbH5LcocmCF1BTOCqYN5y5yzfDfYbp3KNW9kH53pdmwCsjiEirdxxUGDoG1Ke3JBEXfSl+4XubirHSR8z+VlFmPPXZGU8mMogwq9Eg822RYjvbwvZG74djcf7kdfB9KXCPO9u6cWIjLiW+cfXHSXD+1XYFVf9ATU2/NV4YbUzsI4PJRwoGD4BryUNIm2JFeT4c8F4REYTA16shxz5mDTFQ6rbmg6SmqP8G9gAc2Hr9ABD8+2BUNabGhNZ8wDIZArfYS4pl5DNrlPlpqeCjhmvv0znKAJSOac3pCUej8G90ZGwQKOPORWbNVzQShoH7QvrXV8pCklcia6psuAGO+Oj72oDWPxedE4DjdjX5TbLoW4bzsk/YNfUv4JpjGR8DWpG8IFYJG9CCjMEYb",
'__LASTFOCUS':"",
'__EVENTARGUMENT':"",
'__EVENTTARGET':"",
'ctl00_MainContent_ToolkitScriptManager1_HiddenField':";;AjaxControlToolkit,+Version=4.1.51116.0,+Culture=neutral,+PublicKeyToken=28f01b0e84b6d53e:en-US:fd384f95-1b49-47cf-9b47-2fa2a921a36a:475a4ef5:addc6819:5546a2b:d2e10b12:effe2a26:37e2e5c9:5a682656:c7029a2:e9e598a9"},method='POST',cookies = cookies1)
def parse1(self, response):
path1 = "id('Panel1')"
value1 = response.xpath(path1).extract_first()
print value1
First of all, you are sending the spider more time on the same site, though with different form parameters. You have therefore to use dont_filter=True in the request, otherwise Scrapy blocks duplicate calls.
Then it seems to me that the site you are scraping don't allow you to make more than one request during the same session. Try for example to go to http://fcainfoweb.nic.in/PMSver2/Reports/Report_Menu_web.aspx with your browser, compile the form, get the data and than to go back to the initial page: It's impossible. So you have to modify your spider. Here's a very rough code just to give an idea. It works for me, but please don't use it in production!
class SpidyQuotesViewStateSpider(scrapy.Spider):
name = 'retail_price'
urls = "http://fcainfoweb.nic.in/PMSver2/Reports/Report_Menu_web.aspx"
def start_requests(self):
dated = ["01/03/2017","05/03/2017","04/03/2017"]
for i in dated:
request = scrapy.Request(url=self.urls, dont_filter=True, callback=self.parse)
request.meta['question'] = i
yield request
def parse(self, response):
thedate = response.meta['question']
cookies1 ={}
val = response.headers.getlist('Set-Cookie')
print("login session values" ,response.headers.getlist('Set-Cookie'))
if(len(val) != 0):
cookies1['ASP.NET_SessionId'] = str(str(response.headers.getlist('Set-Cookie')[0]).split(";")[0].split("=")[1])
cookies1['path'] = str(str(response.headers.getlist('Set-Cookie')[0]).split(";")[1].split("=")[1])
yield scrapy.FormRequest(url=self.urls, dont_filter=True, callback=self.parse1, formdata={'ctl00$MainContent$btn_getdata1':"Get Data",
'ctl00$MainContent$Txt_FrmDate': thedate,
'ctl00$MainContent$Ddl_Rpt_Option0':"Daily Prices",
'ctl00$MainContent$Rbl_Rpt_type':"Price report",
'ctl00$MainContent$ddl_Language':"English",
'ctl00$MainContent$Ddl_Rpt_type':"Retail",
'__EVENTVALIDATION':"IwZyKgfTXVzxiHxiPXGk/W8XQZBDb0EOPxJh6s8hofq0ffqOpiHSH77CafcxySF3PbkYgSMNFCJhLM2cGnL6SxT0PJuGDCJtV0V8Y4a94UErUCiSANiin+4uKckk9v9Ux8JqTVeaipppmlH+wyks2U9SgPfkNUsqw4eHCkDyB5akNNZImRIixOHHVY3JSXGkwXn7ueK9w+AgnqJzpXaWdMr9J1++M4VAFImSNF8brFSfPHe5kb/qzkGIwUr/KRouaRYK8WLWZh/Mbl9xwREwhDSxWJSOdihSE0WWoaqSMtpaR99rDDCsD3mdJqfu0aPIlREupTZRzlrmztXU0eS3949YW+ywdTRvykaMNgOW2Q4saYP5j/niKbRW6GiDnaLV2A38X/HW80+trrsjwJr9tjTKVFyikf6s/3gzyiTp11ivSkwIY2b3hutjYn7OfTDo",
#'__EVENTVALIDATION':"HqVo2xHk04clYwnBposXbZGhbIr181A7RbyeZv74Cia7rXSKmpOpbeSnn3XXnoDJKRxMK0W9nxKZFfkNje+P/K7gE5HVjHJr9Gr0Gs46TntzKDsvzyii8jZ7e0fdZgQCJKoXxQNgR2vNkWqChKcEldBuMHCOgJRqCNCF/JPFKpdKZoIWr7GU8rhzwLijf/Gkm+FuTULs/fl2HHK6Z1QQEozzEHFsDwzl0G4IiN//eNYfHuUBXKZ3wdZzPqG0s53WHEuSBzhqBC9AtCJOs4ZZhdtwFh8iyTJ4PlsLP9DLHYHRCOAd72UO0UH8gT7gAkKVo1I4L540DilowOR9SttH7MM/oOs9qhKlnG61FgqkYGW8zGzF/yNEXO+beVAK1RVvuO+FDnuq/g36TRnUieei5GpAZ+96CSoCIxykdvHx8R+smTNF/5erlowV4ci+tcI7",
'__VIEWSTATEENCRYPTED':"",
'__VIEWSTATEGENERATOR':"85862B00",
'__VIEWSTATE':"+a+3jrBEKxDdkPOzx2wXwKaTMWvCB60WPaHRfJUAZQrdFIpxSqFr5VseTclpGzeHXdxaFnxJe/PkxDKYa7sj3Wiv/os1bNeX0IEB3s45eFsHYWGiU8cvsXCGa5z7rrGRDL5hotg7k/MuUWj8w27xXZO423MN5OsHS+wh+tC/5/Xix+w3zxuQhi8jR5DnreimHbhGZn1sYaKYIGCc8mDIDRNl+w1OZ058F+3LAx96QUu5BYiMYOmrlyxrb9b2yPTmmIrI4NtC4ClBQlxuST5wMDP3vUqqWMhn4auk8ev5gHyPestCRrsAXWs07wDNnikemMwo/4wPiTEbnZQV6SLcDUw0gZpXjXwLI7mhsVjEyVNaQnJp6+Wi6FLsAEEMlFYmQut3JecpVIUkjF9uYSN2GLIbXHPs37AiEXPeQ8E/GyBMx3z1X5l8sw/xSNmFgYQC3riajn8V0+SdkuV2PbNbYKtc+uoSCNLppLYCqiOv5eWanGvAQro2Q67FBA4w2xY+V/K8mzHaGMLoDBxJxLslWyJpL5cX0C6qoXVUu8B028auAQM4eVzH1YPF5qrJiCDo",
#'__VIEWSTATE':"W+m8kNAS6QHiRPo+zFj00EDs/Dbq+y/XvtCmSNwOIkGKlikAlphT8HBAWQDskSm1vdNterBuo0Hy7m4xPbXMOnyEm6IlseXO3jPw+ofnI2WHAKknLil+GeS0IfMWGeoD5aNyiz3zh1jZkKU7R7hQsxwARoHRyjhf8UCooFbkVvL6ddHVYZbH5LcocmCF1BTOCqYN5y5yzfDfYbp3KNW9kH53pdmwCsjiEirdxxUGDoG1Ke3JBEXfSl+4XubirHSR8z+VlFmPPXZGU8mMogwq9Eg822RYjvbwvZG74djcf7kdfB9KXCPO9u6cWIjLiW+cfXHSXD+1XYFVf9ATU2/NV4YbUzsI4PJRwoGD4BryUNIm2JFeT4c8F4REYTA16shxz5mDTFQ6rbmg6SmqP8G9gAc2Hr9ABD8+2BUNabGhNZ8wDIZArfYS4pl5DNrlPlpqeCjhmvv0znKAJSOac3pCUej8G90ZGwQKOPORWbNVzQShoH7QvrXV8pCklcia6psuAGO+Oj72oDWPxedE4DjdjX5TbLoW4bzsk/YNfUv4JpjGR8DWpG8IFYJG9CCjMEYb",
'__LASTFOCUS':"",
'__EVENTARGUMENT':"",
'__EVENTTARGET':"",
'ctl00_MainContent_ToolkitScriptManager1_HiddenField':";;AjaxControlToolkit,+Version=4.1.51116.0,+Culture=neutral,+PublicKeyToken=28f01b0e84b6d53e:en-US:fd384f95-1b49-47cf-9b47-2fa2a921a36a:475a4ef5:addc6819:5546a2b:d2e10b12:effe2a26:37e2e5c9:5a682656:c7029a2:e9e598a9"},method='POST',cookies = cookies1)
def parse1(self, response):
path1 = "id('Panel1')"
value1 = response.xpath(path1).extract_first()[:574]
print(value1)

Defining URL list for crawler, syntax issues

I'm currently running the following code:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
def hltv_match_list(max_offset):
offset = 0
while offset < max_offset:
url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
base = "http://www.hltv.org/"
soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
href = urljoin(base, (a["href"] for a in cont))
# print([urljoin(base, a["href"]) for a in cont])
get_hltv_match_data(href)
offset += 50
def get_hltv_match_data(matchid_url):
source_code = requests.get(matchid_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
print teamid.string
hltv_match_list(5)
Errors:
File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
href = urljoin(base, (a["href"] for a in cont))
File "C:\Python27\lib\urlparse.py", line 261, in urljoin
urlparse(url, bscheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'
Process finished with exit code 1
I think I'm having trouble with the href = urljoin(base, (a["href"] for a in cont)) part as I'm trying to create a url list I can feed into get_hltv_match_datato then capture various items within that page. Am I going about this wrong?
Cheers
You need to join each href as per your commented code:
urls = [urljoin(base,a["href"]) for a in cont]
You are trying to join the base url to a generator i.e (a["href"] for a in cont) which makes no sense.
You should also be passing the url to requests or you are going to be requesting the same page over and over.
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

Using scrapy to download google images from multiple urls

I am trying to download images from multiple urls from a search in google images.
However, i only want 15 images from each url.
class imageSpider(BaseSpider):
name = "image"
start_urls = [
'https://google.com/search?q=simpsons&tbm=isch'
'https://google.com/search?q=futurama&tbm=isch'
]
def parse(self,response):
hxs = HtmlXPathSelector(response)
items = []
images = hxs.select("//div[#id='ires']//div//a[#href]")
count = 0
for image in images:
count += 1
item = ImageItem()
image_url = image.select(".//img[#src]")[0].extract()
import urlparse
image_absolute_url = urlparse.urljoin(response.url, image_url.strip())
index = image_absolute_url.index("src")
changedUrl = image_absolute_url[index+5:len(image_absolute_url)-2]
item['image_urls'] = [changedUrl]
index1 = site['url'].index("search?q=")
index2 = site['url'].index("&tbm=isch")
imageName = site['url'][index1+9:index2]
download(changedUrl,imageName + str(count)+".png")
items.append(item)
if count == 15:
break
return items
The download function downloads the images (i have code for that. that's not the problem).
The problem is that when i break, it stops at the first url and never continues on to the next url. How could i make it download 15 images for the first url and then 15 images for the 2nd url. I am using break because there are about 1000 images in every google images page and i don't want that many.
The problem is not about break statement. you have missed a comma in start_urls.
it should be like this:
start_urls = [
'http://google.com/search?q=simpsons&tbm=isch',
'http://google.com/search?q=futurama&tbm=isch'
]

Resources