I need to scrape a webpage the link to which is here
In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded.
Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?
The data is loaded through Javascript, but you can extract it with requests, BeautifulSoup and json module:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )
d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
if item['componentName'] == 'PdpWrapper':
d = item
break
if d:
cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
print(json.dumps(cross_reverence_product_tiles, indent=4))
Prints:
[
{
"partId": "16571604",
"partNumber": "CGB3B1X5R1A475M055AC",
"productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
"manufacturerName": "TDK",
"productLineTitle": "Capacitor Ceramic Multilayer",
"productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
"datasheetUrl": "",
"lowestPrice": 0.0645,
"lowestPriceFormatted": "$0.0645",
"highestPrice": 0.3133,
"highestPriceFormatted": "$0.3133",
"stockFormatted": "1,875",
"stock": 1875,
"attributes": [],
"buyingOptionType": "AddToCart",
"numberOfAttributesToShow": 1,
"rrClickTrackingUrl": null,
"pricingDataPopulated": true,
"sourcePartId": "V72:2272_06586404",
"sourceCode": "ACNA",
"packagingType": "Cut Strip",
"unitOfMeasure": "",
"isDiscontinued": false,
"productTileHint": null,
"tileSize": 1,
"tileType": "1x1",
"suplementaryClasses": "u-height"
},
...and so on.
Selenium alone will be enough to scrape the Cross References section inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='WideSidebarProductList-list']//h4")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
Related
Please help. I am getting the following error when trying to run the following code ...
Code is ...
$client = Client::createChromeClient(null, [
'--headless',
'--no-sandbox',
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'--window-size=1200,1100',
'--disable-gpu',
],
["port" => 9080, 'request_timeout_in_ms' => 100000]
);
$client->request('GET', 'https://www.apple.com');
The error I am getting is
unknown error: net::ERR_NAME_NOT_RESOLVED\n (Session info: headless chrome=107.0.5304.87)",
"#0 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/HttpCommandExecutor.php(385): Facebook\\WebDriver\\Exception\\WebDriverException::throwException()\n#1 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php(598): Facebook\\WebDriver\\Remote\\HttpCommandExecutor->execute()\n#2 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php(257): Facebook\\WebDriver\\Remote\\RemoteWebDriver->execute()\n#3 /var/www/html/tests/php/scraping/panther/vendor/symfony/panther/src/Client.php(532): Facebook\\WebDriver\\Remote\\RemoteWebDriver->get()\n#4 /var/www/html/tests/php/scraping/panther/vendor/symfony/panther/src/Client.php(276): Symfony\\Component\\Panther\\Client->get()\n#5 /var/www/html/tests/php/scraping/panther/index.php(26): Symfony\\Component\\Panther\\Client->request()\n#6 {main}"
I'm sending hits to GA Measurement Protocol, and some of them do not make it to the GA. I've noticed that all of them have one thing in common: the user-agent is Firefox, only varying version and device. Some examples:
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0
Mozilla/5.0 (Android 10; Mobile; rv:103.0) Gecko/103.0 Firefox/103.0
GA validator is OK with those examples when checking them through the debug mode like this:
https://www.google-analytics.com/debug/collect?v=1&tid=UA-XXXXXXXX-1&t=event&ec=Ecommerce&ea=purchase&pa=purchase&cid=1234567890.1234567890&ni=1&ti=184242&tr=1060&uip=X.X.X.X&ua=Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%3B+rv%3A103.0%29+Gecko%2F20100101+Firefox%2F103.0&pr1id=test_1&pr1pr=530&pr1qt=1&pr1ps=1
I get this response:
{
"hitParsingResult": [ {
"valid": true,
"parserMessage": [ ],
"hit": "/debug/collect?v=1..."
} ],
"parserMessage": [ {
"messageType": "INFO",
"description": "Found 1 hit in the request."
} ]
}
BUT in the production settings GA responses with 400 bad request error to the same requests without providing any details: "Your client has issued a malformed or illegal request. That’s all we know.".
So what might be wrong with Firefox UA?
UPD: I've managed to make this work by unsetting the 'User-Agent' header in case it contains 'Firefox' - and the corresponding 'ua' parameter in the payload gets accepted then.
if (strpos($requestHeaders['User-Agent'], 'Firefox') !== false) {
unset($requestHeaders['User-Agent']);
}
But it's still unclear what was wrong with such headers in the first place.
I am running a scraping bot in headless mode. As you know it contains headless string in useragent when it's running in headless mode. To avoid that issue, I changed useragent. And the website detect this fake useragent and block scraping bot. How can I prevent this detection?
I am using selenium chromedriver.
Please add those options
# windows_useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# linux_useragent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("user-agent=#{linux_useragent}")
options.add_argument("--disable-web-security")
options.add_argument("--disable-xss-auditor")
options.add_option("excludeSwitches", ["enable-automation", "load-extension"])
navigator.platform and navigator.userAgent should be matched.
If userAgent is for windows, then navigator.platform should be "Win32"
If userAgent is for linux, then navigator.platform should be "Linux x86_64"
You can set like that
platform = {
windows: "Win32",
linux: "Linux x86_64"
}
driver.execute_cdp("Page.addScriptToEvaluateOnNewDocument", {
"source": "
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
}),
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
}),
Object.defineProperty(navigator, 'platform', {
get: () => \"#{platform[:linux]}\"
})"
})
and of course you need to set navigator.webdriver to undefined
When run the cobalt, I can see the useragent from the log:
[0101/000230:INFO:application.cc(690)] User Agent: Mozilla/5.0 (DirectFB; Linux x86_64) Cobalt/4.13031-qa (unlike Gecko) Starboard/1
So where does it come from? Is there a way to change it?
The default useragent is set in the following file, you can have a check:
https://cobalt.googlesource.com/cobalt/+/e9b4b99dab6e774b8b6e63add74c352cc5dd395a/src/cobalt/network/user_agent_string_factory.cc
std::string UserAgentStringFactory::CreateUserAgentString() {
// Cobalt's user agent contains the following sections:
// Mozilla/5.0 (ChromiumStylePlatform)
// Cobalt/Version.BuildNumber-BuildConfiguration (unlike Gecko)
// Starboard/APIVersion,
// Device/FirmwareVersion (Brand, Model, ConnectionType)
// Mozilla/5.0 (ChromiumStylePlatform)
std::string user_agent =
base::StringPrintf("Mozilla/5.0 (%s)", CreatePlatformString().c_str());
// Cobalt/Version.BuildNumber-BuildConfiguration (unlike Gecko)
base::StringAppendF(&user_agent, " Cobalt/%s.%s-%s (unlike Gecko)",
COBALT_VERSION, COBALT_BUILD_VERSION_NUMBER,
kBuildConfiguration);
// Starboard/APIVersion,
if (!starboard_version_.empty()) {
base::StringAppendF(&user_agent, " %s", starboard_version_.c_str());
}
// Device/FirmwareVersion (Brand, Model, ConnectionType)
if (youtube_tv_info_) {
base::StringAppendF(
&user_agent, ", %s_%s_%s/%s (%s, %s, %s)",
youtube_tv_info_->network_operator.value_or("").c_str(),
CreateDeviceTypeString().c_str(),
youtube_tv_info_->chipset_model_number.value_or("").c_str(),
youtube_tv_info_->firmware_version.value_or("").c_str(),
youtube_tv_info_->brand.c_str(), youtube_tv_info_->model.c_str(),
CreateConnectionTypeString().c_str());
}
return user_agent;
}
If your SbSystemGetDeviceType() is true for SystemDeviceTypeIsTv() (in file user_agent_string_factory_starboard.cc), you can customize the UA by implementing some fields of SbSystemGetProperty() + some SbSystemGet() functions.
This is a typical example:
Mozilla/5.0 (1) Cobalt/11.119147-gold (unlike Gecko) Starboard/8, 2_8_6/5 (3, 4, 7)
where,
kSbSystemPropertyPlatformName
kSbSystemPropertyNetworkOperatorName
kSbSystemPropertyManufacturerName
kSbSystemPropertyModelName
kSbSystemPropertyFirmwareVersion
kSbSystemPropertyChipsetModelNumber
SbSystemGetConnectionType()
SbSystemGetDeviceType()
I want to run tests that change the user-agent in the http request sent from the browser (like the FF add-on, user agent switcher does). I saw you can do it by playing with the FF profile (http://seleniumhq.org/docs/09_webdriver.html).
Is there a way to do it within a test? Something like the function addCustomRequestHeader() that sets a header rather than adding it
You could insert a function like this to change the user agent on the fly before you make your http request:
function changeuserAgent() {
var altuserAgentGetter = function () {
return "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 <choose your string>";
};
if (Object.defineProperty) {
Object.defineProperty(navigator, "userAgent", {
get: altuserAgentGetter
});
}
else if (Object.prototype.__defineGetter__) {
navigator.__defineGetter__("userAgent", altuserAgentGetter);
}
}
If you're using the Selenium 2 Web Driver in Java, you can create a Firefox profile and set the agent string as a preference in the profile. Then use the profile to create the WebDriver object:
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("general.useragent.override", "Mozilla/5.0 (iPad; U; CPU OS 4_3 like Mac OS X; de-de) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8F191 Safari/6533.18.5");
WebDriver driver = new FirefoxDriver(profile);
For slightly more information and source code examples, see the Selenium Web Driver documentation for Firefox Driver at http://seleniumhq.org/docs/03_webdriver.html#firefox-driver.