I want to scrape daily value changes from public page:
[1] http://www.example.com/page.html
I've got a full xpath:
[2] /html/body/div[5]/table[1]/tbody/tr[4]/td[2]/#data-val
or command that works to get that value thru Chrome console:
[3] $x("string(/html/body/div[5]/table[1]/tbody/tr[4]/td[2]/#data-val)")
But i'm stuck how to make/encode [1] + [2]/[3], that could retrive that data-val using just a http request? (i'm using integromat, to make http request, but failed to find any reasonable examples).
You will have to make a get request to load the document.
Afterwards you can use a library to extract the value by xpath.
Please Provide more Info on which Language/Framework you are on.
Here is an example in python for a refrence:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').get()
Related
I am working with Jmeter and Blazemeter in a login script for a web made with Genexus.
The problem that I am having is in the POST.
Whenever I try to make a POST http petition, Jmeter throws the next thing:
As you can see, in the response body, I am having a 440 http error code. This is a login Time-out which means the client's session has expired and must log in again. I used to have a 403 error code but now, after doing some arrangements, I have 440. Do you have any suggestions on how to resolve this?
First, I'm not an expert on Genexus. All my findings are from a black-box point of view.
Genexus Security
I found that Genexus requires at least two things to authenticate on Web Application (I tested only Java and .Net generated apps).
The GXState parameter. This param is sent in post request, and from my understanding works as "Synchronizer token pattern", see more info on Cross-site request forgery. We need to send this param on every post request.
The gxajaxEvt parameter. This is very specific to Genexus Apps. In the documentation mentions this parameter is send encrypted in the URL, and this behavior is managed by the "Javascript debug mode property":
# Javascript Debug Mode: Yes
http://{server}:{port}/{webappname}/servlet/com.{kbname}.{objectname}?gxfullajaxEvt,gx-no-cache=1442811265833
# Javascript Debug Mode: No (default value)
http://{server}:{port}/{webappname}/servlet/com.{kbname}.{objectname}?64df96a2d9b8480aed416e470dae529e,gx-no-cache=1442811265833
JMeter Script
So, to get the GXState, we can use the Regular Expression Extractor:
Name of created variable: GXState
Regular expression: name="GXState" value='(.*?)'
Template: $1$
Match No.: 1
Default Value: NOT_FOUND
The GXState is a JSON, object, from it we can extract the GX_AJAX_KEY to encrypt gxajaxEvt string. Note that, I found the GX_AJAX_KEY is the key used to encrypt in this case, but some others could apply. We can debug this using Browser Web Console, with this:
gx.sec.encrypt("gxajaxEvt")
We'll see something like this:
"8722e2ea52fd44f599d35d1534485d8e206d507a46070a816ca7fcdbe812b0ad"
As we can found, all the client encryption code is in the gxgral.js file. Genexus uses the Rijndael algortihm (Sub set of AES) with block size of 128 bits.
To emulate this client behavior in the JMeter Script we can use the "JSR 233 sampler". A way to get the Rijndael results is use the Bouncy Castle library. We need to add this jar (bouncycastle:bcprov-jdk15to18:1.68) to the JMeter's lib folder to use it.
Our code script will be something like this (Language Groovy 3.0.5/Groovy Scripting Engine 2.0):
import com.jayway.jsonpath.JsonPath
import java.nio.charset.StandardCharsets
import java.util.Arrays
import org.bouncycastle.crypto.BufferedBlockCipher
import org.bouncycastle.crypto.InvalidCipherTextException
import org.bouncycastle.crypto.engines.RijndaelEngine
import org.bouncycastle.crypto.params.KeyParameter
import org.bouncycastle.util.encoders.Hex
import org.apache.jmeter.threads.JMeterContextService
import org.apache.jmeter.threads.JMeterContext
import org.apache.jmeter.threads.JMeterVariables
String gxState = vars.get('GXState')
String gxAjaxKey = JsonPath.read(gxState,'$.GX_AJAX_KEY')
byte[] input = Arrays.copyOf('gxajaxEvt'.getBytes(StandardCharsets.UTF_8), 16)
RijndaelEngine engine = new RijndaelEngine(128)
KeyParameter key = new KeyParameter(Hex.decode(gxAjaxKey))
BufferedBlockCipher cipher = new BufferedBlockCipher(engine)
cipher.init(true, key)
byte[] out = new byte[16]
int length = cipher.processBytes(input, 0, 16, out, 0)
cipher.doFinal(out, length)
String encryptedOutput= Hex.toHexString(out)
log.info 'gx.sec.encrypt("gxajaxEvt")='+encryptedOutput
String gxNoCache = String.valueOf(System.currentTimeMillis())
log.info 'gx-no-cache='+gxNoCache
vars.put('gxajaxEvt', encryptedOutput)
vars.put('gxNoCache', gxNoCache)
The script work like this:
First, We get the previos GXState variable extracted.
Second, Using JSON Path (Already available in JMeter 5.4.1) extract the GX_AJAX_KEY property.
Third, We apply the Rijndael algorithm over the gxajaxEvt using the GX_AJAX_KEY as a key.
We also create the gx-no-cache to handle the cache.
With these variables we can send the next request successfully:
We can found this sample JMeter script available here.
For complex scripts, please refer to this guide (Requires GXTest)
In case we get this exception in JMeter ( java.util.zip.ZipException: Not in GZIP format) please refer this answer too.
Any HTTP Status 4xx is a client error, to wit you're sending an incorrect request.
If custom 440 http status code means "session has expired" my expectation is that you have a recorded hard-coded session ID somewhere in your request parameters or headers
You should carefully inspect previous response(s) and look for something which appears to be a session id, once you find it - extract it using a suitable JMeter's Post-Processor and replace hard-coded session ID with the appropriate JMeter Variable. The process is known as correlation
I am trying to scrape property data on from "http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?pin=9906000005".
I identify the element that I am interested in ("Base Zone" data in the table) and copied the xpath from the chrome developer tool. When I run it through scrapy I get an empty list.
I used the scrapy shell to upload the site and typed several response requests. The page loads and I can scrape the header, but nothing in the body of the page loads, it all comes up as empty lists.
My scrapy script is as follows:
class ZoneSpider(scrapy.Spider):
name = 'zone'
allowed_domains = ['web']
start_urls = ['http://web6.seattle.gov/DPD/ParcelData/parcel.aspx?
pin=9906000005']
def parse(self, response):
self.log("base_zone: %s" % response.xpath('//*[#id="ctl00_cph_p_i1_i0_vwZoning"]/tbody/tr/td/table/tbody/tr[1]/td[2]/span/text()').extract())
self.log("use: %s" % response.xpath('//*[#id="ctl00_cph_p_i3_i0_vwKC"]/tbody/tr/td/table/tbody/tr[3]/td[2]/text()').extract())
You will see that the logs return an empty list. In the scray shell when I use query the xpath for the header I get a valid response:
response.xpath('//*[#id="ctl00_headSection"]/title/text()').extract()
['\r\n\tSeattle Parcel Data\r\n']
But when I query anything in the body I get an empty list:
response.xpath('/body').extract()
[]
What I would like to see in my scrapy code is a response like the following:
base_zone: "SF 5000"
use: "Duplex"
If you remove tbody from your XPATH it will work
Since Developer Tools operate on a live browser DOM, what you’ll
actually see when inspecting the page source is not the original HTML,
but a modified one after applying some browser clean up and executing
Javascript code. Firefox, in particular, is known for adding
elements to tables. Scrapy, on the other hand, does not modify the
original page HTML, so you won’t be able to extract any data if you
use in your XPath expressions.
Source: https://docs.scrapy.org/en/latest/topics/developer-tools.html#caveats-with-inspecting-the-live-browser-dom
I'd like to iterate over the request parameters in the request handler.
I'm following the example from the documentation, but I can't get it to work.
By following the getting started guide and using the piece of code provided to range over parameters, I get :
actions/home.go:8:26: undefined: url
Is there a way to iterate over the request parameters using buffalo's context?
You have to import the url package link
import "net/url"
Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!
You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.
I've used Fiddler to capture these HTTP calls. Here's the problem:
I have a HTTP-POST data that looks like below:
Notice how it has many 'employeeIds' and also 'shiftSumIds'.
Now, these Ids are from a previous HTTP response that looks like below:
Is there an easy way to extract those Ids and prepare the POST data? Thanks in advance.
--Ishti
Short answer is JSON Path Extractor available via JMeter Plugins which is designed for getting "interesting" values from JSON data. See Using the XPath Extractor in JMeter guide (look for "Parsing JSON" chapter) for installation instructions and some form of JSON Path language reference.
If it is not enough and you will need some assistance in constructing JSON Path query and building HTTP Request from it - please include text version of response and request using i.e. http://paste.org service as reading large amount of text from small screenshot isn't very handy and chance of getting the answer is minimal