Extracting data from multiple tables with Scrapy using xpath

Extracting data from multiple tables with Scrapy using xpath - xpath

I'm extracting meta data and urls from 12 tables on a web page and while I've got it working, I'm pretty new to both xpath and scrapy so is there a more concise way I could have done this?
I was initially getting loads of duplicates as I tried a variety of xpaths and realised each table row was being repeated for each table. My solution to that was to enumerate the tables and loop through each one grabbing the rows only for that table. Feels like there is probably a simpler way to do it but I'm not sure now.
import scrapy
class LinkCheckerSpider(scrapy.Spider):
name = 'foodstandardsagency'
allowed_domains = ['ratings.food.gov.uk']
start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']
def parse(self, response):
print(response.url)
tables = response.xpath('//*[#id="openDataStatic"]//table')
num_tables = len(tables)
for tabno in range(num_tables):
search_path = '// *[ # id = "openDataStatic"] / table[%d] / tr'%tabno
rows = response.xpath(search_path)
for row in rows:
local_authority = row.xpath('td[1]//text()').extract()
last_update = row.xpath('td[2]//text()').extract()
num_businesses = row.xpath('td[3]//text()').extract()
xml_file_descr = row.xpath('td[4]//text()').extract()
xml_file = row.xpath('td[4]/a/#href').extract()
yield {'local_authority': local_authority[1],
'last_update':last_update[1],
'num_businesses':num_businesses[1],
'xml_file':xml_file[0],
'xml_file_descr':xml_file_descr[1]
}
'''
And I'm running it with
scrapy runspider fsa_xpath.py

You can iterate though the table selectors returned by your first xpath:
tables = response.xpath('//*[#id="openDataStatic"]//table')
for table in tables:
for row in table.xpath('./tr'):
local_authority = row.xpath('td[1]//text()').extract()
You did this with the rows.

Related

Scrapy unable to scrape items, xpath not working

I spend lot of time trying to scrape information with scrapy without sucess.
My goal is to surf through category and for each item scrape title,price and title's href link.
The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...
Here is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item
class robot_makerSpider(CrawlSpider):
name = "robot_makerSpider"
allowed_domains = ["robot-maker.com"]
start_urls = [
"http://www.robot-maker.com/shop/",
]
rules = (
Rule(LinkExtractor(
allow=(
"http://www.robot-maker.com/shop/12-kits-robots",
"http://www.robot-maker.com/shop/36-kits-debutants-arduino",
"http://www.robot-maker.com/shop/13-cartes-programmables",
"http://www.robot-maker.com/shop/14-shields",
"http://www.robot-maker.com/shop/15-capteurs",
"http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
"http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
"http://www.robot-maker.com/shop/18-composants",
"http://www.robot-maker.com/shop/20-alimentation",
"http://www.robot-maker.com/shop/21-impression-3d",
"http://www.robot-maker.com/shop/27-outillage",
),
),
callback='parse_items',
),
)
def parse_items(self, response):
hxs = Selector(response)
products = hxs.xpath("//div[#id='center_column']/ul/li")
items = []
for product in products:
item = electronic_Item()
item['title'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/text()").extract()
item['price'] = product.xpath(
"div/div/div[3]/div/div[1]/span[1]/text()").extract()
item['url'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/#href").extract()
#check that all field exist
if item['title'] and item['price'] and item['url']:
items.append(item)
return items
thanks for your help

The xpaths in your spider are indeed faulty.
Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.
I've got it working with:
products = response.xpath("//div[#class='product-container']")
items = []
for product in products:
item = dict()
item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
item['url'] = product.xpath('.//h2/a/#href').extract_first()
item['price'] = product.xpath(".//span[contains(#class,'product-price')]/text()").extract_first('').strip()
All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).
So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.

Empty CSV after Scrapy Crawl

I keep on getting an empty CSV file after running my code. I suspect it might be the XPaths but I really don't know what I'm doing. There aren't any errors reported in the terminal output. I'm trying to get info from various Craigslist pages.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from craigslist_probe.items import CraigslistSampleItem
class MySpider(Spider):
name = "why"
allowed_domains = ["craigslist.org"]
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
titles = response.selector.xpath("/section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath("./div[#class='tray']").extract()
item["body"] = titles.xpath("./section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
return items

I suspect your XPath doesn't correspond to the HTML structure of the page. Notice that single slash (/) infers direct child, so, for example, /section would only work if the root element of the page is <section> element, which hardly ever be the case. Try using // all over :
def parse(self, response):
titles = response.selector.xpath("//section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath(".//div[#class='tray']").extract()
item["body"] = titles.xpath(".//section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)

Using Rails Update to Append to a Text Column in Postgresql

Thanks in advance for any help on this one.
I have a model in rails that includes a postgresql text column.
I want to append (i.e. mycolumn = mycolumn || newdata) data to the existing column. The sql I want to generate would look like:
update MyOjbs set mycolumn = mycolumn || newdata where id = 12;
I would rather not select the data, update the attribute and then write the new data back to the database. The text column could grow relatively large and I'd rather not read that data if I don't need to.
I DO NOT want to do this:
#myinstvar = MyObj.select(:mycolumn).find(12)
newdata = #myinstvar.mycolumn.to_s + newdata
#myinstvar.update_attribute(:mycolumn, newdata)
Do I need to do a raw sql transaction to accomplish this?

I think you could solve this problem directly writing your query using the arel gem, that's already provided with rails.
Given that you have these values:
column_id = 12
newdata = "a custom string"
you can update the table this way:
# Initialize the Table and UpdateManager objects
table = MyOjbs.arel_table
update_manager = Arel::UpdateManager.new Arel::Table.engine
update_manager.table(table)
# Compose the concat() function
concat = Arel::Nodes::NamedFunction.new 'concat', [table[:mycolumn], new_data]
concat_sql = Arel::Nodes::SqlLiteral.new concat.to_sql
# Set up the update manager
update_manager.set(
[[table[:mycolumn], concat_sql]]
).where(
table[:id].eq(column_id)
)
# Execute the update
ActiveRecord::Base.connection.execute update_manager.to_sql
This will generate a SQL string like this one:
UPDATE "MyObjs" SET "mycolumn" = concat("MyObjs"."mycolumn", 'a custom string') WHERE "MyObjs"."id" = 12"

Entity Framework SQL Selecting 600+ Columns

I have a query generated by entity framework running against oracle that's too slow. It runs in about 4 seconds.
This is the main portion of my query
var query = from x in db.BUILDINGs
join pro_co in db.PROFILE_COMMUNITY on x.COMMUNITY_ID equals pro_co.COMMUNITY_ID
join co in db.COMMUNITies on x.COMMUNITY_ID equals co.COMMUNITY_ID
join st in db.STATE_PROFILE on co.STATE_CD equals st.STATE_CD
where pro_co.PROFILE_NM == authorizedUser.ProfileName
select new
{
COMMUNITY_ID = x.COMMUNITY_ID,
COUNTY_ID = x.COUNTY_ID,
REALTOR_GROUP_NM = x.REALTOR_GROUP_NM,
BUILDING_NAME_TX = x.BUILDING_NAME_TX,
ACTIVE_FL = x.ACTIVE_FL,
CONSTR_SQFT_AVAIL_NB = x.CONSTR_SQFT_AVAIL_NB,
TRANS_RAIL_FL = x.TRANS_RAIL_FL,
LAST_UPDATED_DT = x.LAST_UPDATED_DT,
CREATED_DATE = x.CREATED_DATE,
BUILDING_ADDRESS_TX = x.BUILDING_ADDRESS_TX,
BUILDING_ID = x.BUILDING_ID,
COMMUNITY_NM = co.COMMUNITY_NM,
IMAGECOUNT = x.BUILDING_IMAGE2.Count(),
StateCode = st.STATE_NM,
BuildingTypeItems = x.BUILDING_TYPE_ITEM,
BuildingZoningItems = x.BUILDING_ZONING_ITEM,
BuildingSpecFeatures = x.BUILDING_SPEC_FEATURE_ITEM,
buildingHide = x.BUILDING_HIDE,
buildinghideSort = x.BUILDING_HIDE.Count(y => y.PROFILE_NM == ProfileName) > 0 ? 1 : 0,
BUILDING_CITY_TX = x.BUILDING_CITY_TX,
BUILDING_ZIP_TX = x.BUILDING_ZIP_TX,
LPF_GENERAL_DS = x.LPF_GENERAL_DS,
CONSTR_SQFT_TOTAL_NB = x.CONSTR_SQFT_TOTAL_NB,
CONSTR_STORIES_NB = x.CONSTR_STORIES_NB,
CONSTR_CEILING_CENTER_NB = x.CONSTR_CEILING_CENTER_NB,
CONSTR_CEILING_EAVES_NB = x.CONSTR_CEILING_EAVES_NB,
DESCR_EXPANDABLE_FL = x.DESCR_EXPANDABLE_FL,
CONSTR_MATERIAL_TYPE_TX = x.CONSTR_MATERIAL_TYPE_TX,
SITE_ACRES_SALE_NB = x.SITE_ACRES_SALE_NB,
DESCR_PREVIOUS_USE_TX = x.DESCR_PREVIOUS_USE_TX,
CONSTR_YEAR_BUILT_TX = x.CONSTR_YEAR_BUILT_TX,
DESCR_SUBDIVIDE_FL = x.DESCR_SUBDIVIDE_FL,
LOCATION_CITY_LIMITS_FL = x.LOCATION_CITY_LIMITS_FL,
TRANS_INTERSTATE_NEAREST_TX = x.TRANS_INTERSTATE_NEAREST_TX,
TRANS_INTERSTATE_MILES_NB = x.TRANS_INTERSTATE_MILES_NB,
TRANS_HIGHWAY_NAME_TX = x.TRANS_HIGHWAY_NAME_TX,
TRANS_HIGHWAY_MILES_NB = x.TRANS_HIGHWAY_MILES_NB,
TRANS_AIRPORT_COM_NAME_TX = x.TRANS_AIRPORT_COM_NAME_TX,
TRANS_AIRPORT_COM_MILES_NB = x.TRANS_AIRPORT_COM_MILES_NB,
UTIL_ELEC_SUPPLIER_TX = x.UTIL_ELEC_SUPPLIER_TX,
UTIL_GAS_SUPPLIER_TX = x.UTIL_GAS_SUPPLIER_TX,
UTIL_WATER_SUPPLIER_TX = x.UTIL_WATER_SUPPLIER_TX,
UTIL_SEWER_SUPPLIER_TX = x.UTIL_SEWER_SUPPLIER_TX,
UTIL_PHONE_SVC_PVD_TX = x.UTIL_PHONE_SVC_PVD_TX,
CONTACT_ORGANIZATION_TX = x.CONTACT_ORGANIZATION_TX,
CONTACT_PHONE_TX = x.CONTACT_PHONE_TX,
CONTACT_EMAIL_TX = x.CONTACT_EMAIL_TX,
TERMS_SALE_PRICE_TX = x.TERMS_SALE_PRICE_TX,
TERMS_LEASE_SQFT_NB = x.TERMS_LEASE_SQFT_NB
};
There is a section of code that tacks on dynamic where and sort clauses to the query but I've left those out. The query takes about 4 seconds to run no matter what is in the where and sort.
I dropped the generated SQL in Oracle and an explain plan didn't appear to show anything that screamed fix me. Cost is 1554
If this isn't allowed I apologize but I can't seem to find a good way to share this information. I've uploaded the explain plan generated by Sql Developer here: http://www.123server.org/files/explainPlanzip-e1d291efcd.html
Table Layout
Building
--------------------
- BuildingID
- CommunityId
- Lots of other columns
Profile_Community
-----------------------
- CommunityId
- ProfileNM
- lots of other columns
state_profile
---------------------
- StateCD
- ProfileNm
- lots of other columns
Profile
---------------------
- Profile-NM
- a few other columns
All of the tables with allot of columns have 120-150 columns each. It seems like entity is generating a select statement that pulls every column from every table instead of just the ones I want.
The thing that's bugging me and I think might be my issue is that in my LINQ I've selected 50 items, but the generated sql is returning 677 columns. I think returning so many columns is the source of my slowness possibly.
Any ideas why I am getting so many columns returned in SQL or how to speed my query?

I have a suspicion some of the performance is being impacted by your object creation. Try running the query without just a basic "select x" and see if it's the SQL query taking time or the object creation.
Also if the query being generated is too complicated you could try separating it out into smaller sub-queries which gradually enrich your object rather than trying to query everything at once.

I ended up creating a view and having the view only select the columns I wanted and joining on things that needed to be left-joined in linq.
It's pretty annoying that EF selects every column from every table you're trying to join across. But I guess I only noticed this because I am joining a bunch of tables with 150+ columns in them.

How to iterate through table using selenium?

I have a table called UserManagement that contains information about the user.This table gets updated whenever new user is created. If i create two users then i need check whether two users are actually created or not. Table contains ID,UserName,FirstName,LastName,Bdate..ctc. Here ID will be generated automatically.
I am running Selenium-TestNG script.Using Selenium,how can i get the UserName of the two users which i have created? Should i have to iterate through table? If so how to iterate through the table?

Use ISelenium.GetTable(string) to get the contents of the table cells you want. For example,
selenium.GetTable("UserManagement.0.1");
will return the contents of the table's first row and second column. You could then assert that the correct username or usernames appear in the table.

Get the count of rows using Selenium.getxpathcount(\#id = fjsfj\td\tr") in a variable rowcount
Give the columncount in a variable
Ex:
int colcount = 5;
Give the req i.e New user
String user1 = "ABC"
for(i = 0;i <=rowcount;i++)
{
for(j=0;j<=colcount;j++)
{
if (user1==selenium.gettable("//#[id=dldl/tbody" +i "td"+j))
{
system.out.println(user1 + "Inserted");
break;
}
break;
}
}

Get the number of rows using:
int noOfRowsInTable = selenium.getXpathCount("//table[#id='TableId']//tr");
If the UserName you want to get is at fixed position, let's say at 2nd position, then for each row iterate as given below:
selenium.getText("xpath=//table[#id='TableId']//tr//td[1]");
Note: we can find the number of columns in that table using same procedure
int noOfColumnsInTable = selenium.getXpathCount("//table[#id='TableId']//tr//td");

Generically, something like this?
table = #browser.table(:id,'tableID')
table.rows.each do |row|
# perform row operations here
row.cells.each do |cell|
# do cell operations here
end
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting data from multiple tables with Scrapy using xpath - xpath

You can iterate though the table selectors returned by your first xpath: tables = response.xpath('//*[#id="openDataStatic"]//table') for table in tables: for row in table.xpath('./tr'): local_authority = row.xpath('td[1]//text()').extract() You did this with the rows.

Related

Scrapy unable to scrape items, xpath not working

Empty CSV after Scrapy Crawl

Using Rails Update to Append to a Text Column in Postgresql

Entity Framework SQL Selecting 600+ Columns

How to iterate through table using selenium?

Categories

Resources