Scrapy: Fetch HTML class as string and/or int?

Scrapy: Fetch HTML class as string and/or int? - xpath

For educative purposes I've been creating a spider to fetch data from a HTML-based marketplace. I managed to fetch all the text based data I need, however, I need to fetch the itemID, the data-mins-elapsed, the item quality and item trait. These items are not text based, but they are HTML classes and such.
The ItemID is a unique ID for each item. In the HTML code of the website it can be found under: (I need the number "319842588", this number is unique to each item)
<tr class="cursor-pointer" data-on-click-link="/pc/Trade/Detail/319842588" data-on-click-link-action="NewWindow" data-toggle="tooltip" data-original-title="" title="">
The data-mins-elapsed keeps track of when the item has been posted. This number will change everytime you refresh the webpage as time goes by. It can be found under: (I need the number "3", this number will change constantly)
<td class="bold hidden-xs" data-mins-elapsed="3">Now</td>
The itemquality is the quality of a certain item. In the HTML code of the website it can be found under: (I need the "superior", the quality is unique to each item)
<img class="trade-item-icon item-quality-superior"
alt="Icon"
src="/Content/icons/bow.png"
data-trait="Infused"
/>
The itemtrait is the trait of a certain item. In the HTML code of the website it can be found under: (I need the "Infused", the trait is unique to each item)
<img class="trade-item-icon item-quality-superior"
alt="Icon"
src="/Content/icons/bow.png"
data-trait="Infused"
/>
How do I build a XPATH or something similar to fetch these numbers?
Website link: https://eu.tamrieltradecentre.com/pc/Trade/SearchResult?SearchType=Sell&ItemID=10052&ItemNamePattern=Briarheart+Bow&IsChampionPoint=true&LevelMin=160&LevelMax=&ItemCategory1ID=&ItemCategory2ID=&ItemCategory3ID=&ItemQualityID=&ItemTraitID=3&PriceMin=&PriceMax=25000
Part of the relevant HTML code
This includes the HTML for each product, each product is listed in a TR with tha class "cursor-pointer"
<table class="trade-list-table max-width">
<thead>
...
</thead>
<tr class="cursor-pointer" data-on-click-link="/pc/Trade/Detail/319836098"
data-on-click-link-action="NewWindow" data-toggle="tooltip">
<td>
<img class="trade-item-icon item-quality-superior"alt="Icon"
src="/Content/icons/bow.png"data-trait="Infused"/>
<div class="item-quality-superior">
Briarheart Bow
</div>
<div>
Level:
<img class="small-icon" src="/Content/icons/championPoint.png" />
160
</div>
</td>
<td class="hidden-xs">
...
</td>
<td class="hidden-xs">
...
</td>
<td class="gold-amount bold">
...
</td>
<td class="bold hidden-xs" data-mins-elapsed="15"></td>
</tr>
Spider file
# -*- coding: utf-8 -*-
import scrapy
import os
import csv
class TTCSpider(scrapy.Spider):
name = "ttc_spider"
allowed_domains = ["eu.tamrieltradecentre.com"]
start_urls = ['https://eu.tamrieltradecentre.com/pc/Trade/SearchResult?ItemID=10052&SearchType=Sell&ItemNamePattern=Briarheart+Bow&ItemCategory1ID=&ItemCategory2ID=&ItemCategory3ID=&ItemTraitID=3&ItemQualityID=&IsChampionPoint=true&IsChampionPoint=false&LevelMin=160&LevelMax=&MasterWritVoucherMin=&MasterWritVoucherMax=&AmountMin=&AmountMax=&PriceMin=&PriceMax=25000']
def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""
with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
search_text=keyword["keyword"]
url="https://eu.tamrieltradecentre.com/pc/Trade/{0}".format(search_text)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text})
def parse(self, response):
containers = response.css('.cursor-pointer')
for container in containers:
#Defining the XPAths
XPATH_ITEM_NAME = ".//td[1]//div[1]//text()"
XPATH_ITEM_LEVEL = ".//td[1]//div[2]//text()"
XPATH_ITEM_LOCATION = ".//td[3]//div[1]//text()"
XPATH_ITEM_TRADER = ".//td[3]//div[2]//text()"
XPATH_ITEM_PRICE = ".//td[4]//text()[2]"
XPATH_ITEM_QUANTITY = ".//td[4]//text()[4]"
XPATH_ITEM_LASTSEEN = "Help me plis :3"
XPATH_ITEM_ITEMID = "Help me plis :3"
XPATH_ITEM_QUALITY = "Help me plis :3"
XPATH_ITEM_TRAIT = "Help me plis :3"
#Extracting from list
raw_item_name = container.xpath(XPATH_ITEM_NAME).extract()
raw_item_level = container.xpath(XPATH_ITEM_LEVEL).extract()
raw_item_location = container.xpath(XPATH_ITEM_LOCATION).extract()
raw_item_trader = container.xpath(XPATH_ITEM_TRADER).extract()
raw_item_price = container.xpath(XPATH_ITEM_PRICE).extract()
raw_item_quantity = container.xpath(XPATH_ITEM_QUANTITY).extract()
raw_item_lastseen = container.xpath(XPATH_ITEM_LASTSEEN).extract()
raw_item_itemid = container.xpath(XPATH_ITEM_ITEMID).extract()
raw_item_quality = container.xpath(XPATH_ITEM_QUALITY).extract()
raw_item_trait = container.xpath(XPATH_ITEM_TRAIT).extract()
#Cleaning the data
item_name = ''.join(raw_item_name).strip() if raw_item_name else None
item_level = ''.join(raw_item_level).replace('Level:','').strip() if raw_item_level else None
item_location = ''.join(raw_item_location).strip() if raw_item_location else None
item_trader = ''.join(raw_item_trader).strip() if raw_item_trader else None
item_price = ''.join(raw_item_price).strip() if raw_item_price else None
item_quantity = ''.join(raw_item_quantity).strip() if raw_item_quantity else None
item_lastseen = ''.join(raw_item_lastseen).strip() if raw_item_lastseen else None
item_itemid = ''.join(raw_item_itemid).strip() if raw_item_itemid else None
item_quality = ''.join(raw_item_quality).strip() if raw_item_quality else None
item_trait = ''.join(raw_item_trait).strip() if raw_item_trait else None
yield {
'item_name':item_name,
'item_level':item_level,
'item_location':item_location,
'item_trader':item_trader,
'item_price':item_price,
'item_quantity':item_quantity,
'item_lastseen':item_lastseen,
'item_itemid':item_itemid,
'item_quality':item_quality,
'item_trait':item_trait,
}

You can use built-in .re_first() to match regular expression for ItemID:
ItemID = container.xpath('./#data-on-click-link').re_first(r'(\d+)$') # same code for ItemQuality
ItemTrait = container.xpath('.//img[#data-trait]/#data-trait').get()

First of all you shouldn't be asking such questions, a simple google search should suffice. Nonetheless all you need is way to access data available in the attributes of a HTML Node. The way is using # as a prefix to attribute name. e.g: for accessing class attribute you would use div/#class.
For your problem I could suggest a XPath for one of your item, you should be able to take on from that.
XPATH_ITEM_LASTSEEN = ".//td[4]/#data-mins-elapsed"
Also, for getting 319842588 out of data-on-click-link="/pc/Trade/Detail/319842588", you can use XPATH similar to above in addition to python's inbuilt functions like replace() or split() to get the desired data. for example:
suppose you have -
x = "/pc/Trade/Detail/319842588"
# you could do something like
x = x.replace('/pc/Trade/Detail/','') OR x = x.split('/')[-1]
Hope that helps.
Cheers!!

Related

Watir - scraping a grid of items

I'm trying to scrape the app URLs from a directory that's laid out in a grid:
<div id="mas-apps-list-tile-grid" class="mas-app-list">
<div class="solution-tile-container">
<div class="solution-tile-content-container">
<a href="url.com/app/345">
<div class="solution-tile-container">
<div class="solution-tile-content-container">
<a href="url.com/app/567">
... and so on
Here are my 2 lines of Watir code that are supposed to create an array with all URLs from a page:
company_listings = browser.div(id: 'mas-apps-list-tile-grid')
companies = company_listings.map { |div| div.a.href }
But instead of an array with URLs, 'companies' returns:
#<Watir::Map: located: false; {:id=>"mas-apps-list-tile-grid", :tag_name=>"div"} --> {:tag_name=>"map"}>
What am I doing wrong?

The #map method for a Watir::Element (or specifically Watir::Div in this case) returns a Watir::Map element. This is used for locating <map> tags/elements on the page.
In contrast, the #map method for a Watir::ElementCollection will iterate over each of the matching elements. This is what is missing.
You have a couple of options. If you want all the links in the grid, the most straightforward approach is to create a #links or #as element collection:
company_grid = browser.div(id: 'mas-apps-list-tile-grid')
company_hrefs = company_grid.links.map { |a| a.href }
If there are only some links you care about, you'll need to use the link's parents to narrow it down. For example, maybe it's just links located in a "solution-tile-content-container" div:
company_grid = browser.div(id: 'mas-apps-list-tile-grid')
company_listings = company_grid.divs(class: 'solution-tile-content-container')
company_hrefs = company_listings.map { |div| div.a.href }

Scraping the href value of anchor in Ruby

Working on this project where I have to scrape a "website," which is just a an html file in one of the local folders. Anyway, I've been trying to scrape down to the href value (a url) of the anchor tag for each student object. I am also scraping for other things, so ignore the rest. Here is what I have so far:
def self.scrape_index_page(index_url) #responsible for scraping the index page that lists all of the students
#return an array of hashes in which each hash represents one student.
html = index_url
doc = Nokogiri::HTML(open(html))
# doc.css(".student-name").first.text
# doc.css(".student-location").first.text
#student_card = doc.css(".student-card").first
#student_card.css("a").text
end
Here is one of the student profiles. They are all the same, so I'm just interested in scraping the href url value.
<div class="student-card" id="eric-chu-card">
<a href="students/eric-chu.html">
<div class="view-profile-div">
<h3 class="view-profile-text">View Profile</h3>
</div>
<div class="card-text-container">
<h4 class="student-name">Eric Chu</h4>
<p class="student-location">Glenelg, MD</p>
</div>
</a>
</div>
thanks for your help!

Once you get an anchor tag in Nokogiri, you can get the href like this:
anchor["href"]
So in your example, you could get the href by doing the following:
student_card = doc.css(".student-card").first
href = student_card.css("a").first["href"]
If you wanted to collect all of the href values at once, you could do something like this:
hrefs = doc.css(".student-card a").map { |anchor| anchor["href"] }

How to use resources files .resx to get translated column header text in mvcgrid.net

How to use resources files .resx to get translated column header text in mvcgrid.net ?

There is a localisation example: http://mvcgrid.net/demo/localization
But we did this through the _Grid.cshtml view which is configured like this:
GridDefaults gridDefaults = new GridDefaults()
{
RenderingMode = RenderingMode.Controller,
ViewPath = "~/Views/MVCGrid/_Grid.cshtml",
NoResultsMessage = "Sorry, no results were found"
};
and in the _Grid.cshtml on looping through the columns:
<tr>
#foreach (var col in Model.Columns)
{
var thStyleAttr = !String.IsNullOrWhiteSpace(ColumnStyle(col)) ? String.Format(" style='{0}'", ColumnStyle(col)) : "";
<th onclick='#Html.Raw(ColumnOnClick(col))' #(Html.Raw(thStyleAttr))>#DbRes.T(col.HeaderText, "Grids") #(SortImage(col))</th>
}
</tr>
Note that we are not using resources here but we are using this lib: https://github.com/RickStrahl/Westwind.Globalization but I think it should be the same idea.

strict with CGI::AJAX

I have set of code for updating a password in the table, here I'm using CGI::AJAX module to update the password and get the popup screen on corresponding execution.When using that code with my application it is executing properly but I didn't get the output(means Perl subroutine is not called when JavaScript function to get use.password is not updated into table). I don't get any error either.
#!/usr/bin/perl -w
use strict;
use CGI;
use DBI;
use Data::Dumper;
my $p = new CGI qw(header start_html end_html h1 script link);
use Class::Accessor;
use CGI::Ajax;
my $create_newuser;
my $ajax = new CGI::Ajax('fetch_javaScript' => $create_newuser);
print $ajax->build_html($p,\&Show_html,{-charset=>'UTF-8', -expires=>'-1d'});
sub Show_html
{
my $html = <<EOHTML;
<html>
<body bgcolor="#D2B9D3">
<IMG src="karvy.jpg" ALT="image">
<form name='myForm'>
<center><table><tr><td>
<div style="width:400px;height:250px;border:3px solid black;">
<center><h4>Create New Password's</h4>
<p>&nbsp User Name</b>&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp<INPUT TYPE="text" NAME="user" id = "user" size = "15" maxlength = "15" tabindex = "1"/></p>
<p>&nbsp Password:</b>&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp<INPUT TYPE=PASSWORD NAME="newpassword" id = "newpassword" size = "15" maxlength = "15" tabindex = "1"/></p>
<p>&nbsp Re-Password:</b>&nbsp&nbsp&nbsp<INPUT TYPE=PASSWORD NAME="repassword" id = "repassword" size = "15" maxlength = "15" tabindex = "1"/></p>
<input type="submit" id="val" value="Submit" align="middle" method="GET" onclick="fetch_javaScript(['user','newpassword','repassword']);"/><INPUT TYPE="reset" name = "Reset" value = "Reset"/>
<p>Main Menu <A HREF = login.pl>click here</A>
</center>
</div>
</td></tr></table></center>
</form>
</body>
</html>
EOHTML
return $html;
}
$create_newuser =sub
{
my #input = $p->params('args');
my $user=$input[0];
my $password=$input[1];
my $repassword=$input[2];
my $DSN = q/dbi:ODBC:SQLSERVER/;
my $uid = q/123/;
my $pwd = q/123/;
my $DRIVER = "Freetds";
my $dbh = DBI->connect($DSN,$uid,$pwd) or die "Coudn't Connect SQL";
if ($user ne '')
{
if($password eq $repassword)
{
my $sth=$dbh->do("insert into rpt_account_information (user_id,username,password,user_status,is_admin) values(2,'".$user."','".$password."',1,1)");
my $value=$sth;
print $value,"\n";
if($value == 1)
{
print 'Your pass has benn changed.Return to the main page';
}
}
else
{
print "<script>alert('Password and Re-Password does not match')</script>";
}
}
else
{
print "<script>alert('Please Enter the User Name')</script>";
}
}

my $create_newuser;
my $ajax = new CGI::Ajax('fetch_javaScript' => $create_newuser);
...;
$create_newuser =sub { ... };
At the moment when you create a new CGI::Ajax object, the $create_newuser variable is still undef. Only much later do you assign a coderef to it.
You can either assign the $create_newuser before you create the CGI::Ajax:
my $create_newuser =sub { ... };
my $ajax = new CGI::Ajax('fetch_javaScript' => $create_newuser);
...;
Or you use a normal, named subroutine and pass a coderef.
my $ajax = new CGI::Ajax('fetch_javaScript' => \&create_newuser);
...;
sub create_newuser { ... }
Aside from this main error, your script has many more problems.
You should use strict instead of the -w option.
For debugging purposes only, use CGI::Carp 'fatalsToBrowser' and sometimes even with warningsToBrowser can be extremely helpful. Otherwise, keeping a close eye on the error logs is a must.
my $p = new CGI qw(header start_html end_html h1 script link) doesn't make any sense. my $p = CGI->new should be sufficient.
use Class::Accessor seems a bit random here.
The HTML in Show_html is careless. First, your heredocs allows variable interpolation and escape codes – it has the semantics of a double quoted string. Most of the time, you don't want that. Start a heredoc like <<'END_OF_HTML' to avoid interpolation etc.
Secondly, look at that tag soup you are producing! Here are some snippets that astonish me:
bgcolor="#D2B9D3", align="middle" – because CSS hasn't been invented yet.
<center> – because CSS hasn't been invented yet, and this element isn't deprecated at all.
<table><tr><td><div ... </div></td></tr></table> – because there is nothing wrong with a table containing a single cell. (For what? This isn't even for layout reasons!) This table cell contains a single div …
… which contains another center. Seriously, what is so great about unneccessary DOM elements that CSS isn't even an option.
style="width:400px;height:250px;border:3px solid black;" – because responsive design hasn't been invented yet.
<p> ... </b> – Oh, what delicious tag soup!
&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp – this isn't a typewriter, you know. Use CSS and proper markup for your layout. There is a difference between text containing whitespace, and empty areas in your layout.
tabindex = "1" … tabindex = "1" … tabindex = "1" – I don't think you know what tabindex does.
<A HREF = login.pl> – LOWERCASING OR QUOTING YOUR ATTRIBUTES IS FOR THE WEAK!!1
onclick="fetch_javaScript(['user','newpassword','repassword']);" – have you read the CGI::Ajax docs? This is not how it works: You need to define another argument with the ID of the element where the answer HTML is displayed.
In your create_newuser, you have an SQL injection vulnerability. Use placeholders to solve that. Instead of $sth->do("INSERT INTO ... VALUES('$foo')") use $sth->do('INSERT INTO ... VALUES(?)', $foo).
print ... – your Ajax handler shouldn't print output, instead it should return a HTML string, which then gets hooked into the DOM at the place your JS function specified. You want something like
use HTML::Entities;
sub create_newuser {
my ($user, $password, $repassword) = $p->params('args');
my ($e_user, $e_password) = map { encode_entities($_) } $user, $password;
# DON'T DO THIS, it is a joke
return "Hello <em>$e_user</em>, your password <code>$e_password</code> has been successfully transmitted in cleartext!";
}
and in your JS:
fetch_javaScript(['user','newpassword','repassword'], ['answer-element'], 'GET');
where your HTML document somewhere has a <div id="answer-element" />.

Cannot display an image from filesystem in grails

I'm new to grails (1.3.7) and I've been put in a strange situation with displaying an image from the filesystem. The one.png picture is put into web-app/images/fotos directory. The zz.gsp:
<img src="${resource(dir:'images/fotos', file:'one.png')}" alt="Nothing" />
related to the void action def zz = {} works fine. But if I intend to display the same picture in rawRenderImage.gsp:
<body>
<p>
${fdir} <br/> ${fname} <!-- OK -->
</p>
<g:if test="${fname}">
<img src="${resource(dir:'fdir',file: 'fname')}" alt ="Nothing"/>
</g:if>
</body>
the picture doesn't appear inspite of the parameters fdir and fname pass to the page. The action in the controller is:
def rawRenderImage = {
// def basePath = servletContext.getRealPath("/")
// def basePath = grailsAttributes.getApplicationContext().getResource("/").getFile().toString() + "/images/fotos/"
// def basePath = grailsAttributes.getApplicationContext().getResource("/").getFile().toString()
// def fname = params.photoId + ".png"
// def fname = "/images/fotos/" + params.photoId + ".png"
basePath = "images/fotos" // or basePath=”images” for /images/one.png
fname=”one.png”
[fdir:basePath, fname:fname]
}
Even direct assigns basePath=”images/fotos” and fname=”one.png” don't work, as well as any combinations with basePath to obtain the absolute path. Even the case when I put the picture in images directory doesn't work. I use netbeans, but it also doesn't work in console mode.
Help please.

When passing in your filename and directory as variables in the model, don't quote them in your tag's src attribute. Then the Groovy ${} evaluation will evaluate to the variables and not as Strings.
<g:if test="${fname}">
<img src="${resource(dir:fdir,file:fname)}" alt ="Something"/>
</g:if>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scrapy: Fetch HTML class as string and/or int? - xpath

You can use built-in .re_first() to match regular expression for ItemID: ItemID = container.xpath('./#data-on-click-link').re_first(r'(\d+)$') # same code for ItemQuality ItemTrait = container.xpath('.//img[#data-trait]/#data-trait').get()

Related

Watir - scraping a grid of items

Scraping the href value of anchor in Ruby

How to use resources files .resx to get translated column header text in mvcgrid.net

strict with CGI::AJAX

Cannot display an image from filesystem in grails

Categories

Resources