symfony/panther is giving "unknown error: net::ERR_NAME_NOT_RESOLVED\n (Session info: headless chrome=107.0.5304.87)" Error - symfony-panther

Please help. I am getting the following error when trying to run the following code ...
Code is ...
$client = Client::createChromeClient(null, [
'--headless',
'--no-sandbox',
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'--window-size=1200,1100',
'--disable-gpu',
],
["port" => 9080, 'request_timeout_in_ms' => 100000]
);
$client->request('GET', 'https://www.apple.com');
The error I am getting is
unknown error: net::ERR_NAME_NOT_RESOLVED\n (Session info: headless chrome=107.0.5304.87)",
"#0 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/HttpCommandExecutor.php(385): Facebook\\WebDriver\\Exception\\WebDriverException::throwException()\n#1 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php(598): Facebook\\WebDriver\\Remote\\HttpCommandExecutor->execute()\n#2 /var/www/html/tests/php/scraping/panther/vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php(257): Facebook\\WebDriver\\Remote\\RemoteWebDriver->execute()\n#3 /var/www/html/tests/php/scraping/panther/vendor/symfony/panther/src/Client.php(532): Facebook\\WebDriver\\Remote\\RemoteWebDriver->get()\n#4 /var/www/html/tests/php/scraping/panther/vendor/symfony/panther/src/Client.php(276): Symfony\\Component\\Panther\\Client->get()\n#5 /var/www/html/tests/php/scraping/panther/index.php(26): Symfony\\Component\\Panther\\Client->request()\n#6 {main}"

Related

400 bad request error when sending hits with Firefox user-agent to GA Measurement Protocol

I'm sending hits to GA Measurement Protocol, and some of them do not make it to the GA. I've noticed that all of them have one thing in common: the user-agent is Firefox, only varying version and device. Some examples:
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:103.0) Gecko/20100101 Firefox/103.0
Mozilla/5.0 (Android 10; Mobile; rv:103.0) Gecko/103.0 Firefox/103.0
GA validator is OK with those examples when checking them through the debug mode like this:
https://www.google-analytics.com/debug/collect?v=1&tid=UA-XXXXXXXX-1&t=event&ec=Ecommerce&ea=purchase&pa=purchase&cid=1234567890.1234567890&ni=1&ti=184242&tr=1060&uip=X.X.X.X&ua=Mozilla%2F5.0+%28Windows+NT+10.0%3B+Win64%3B+x64%3B+rv%3A103.0%29+Gecko%2F20100101+Firefox%2F103.0&pr1id=test_1&pr1pr=530&pr1qt=1&pr1ps=1
I get this response:
{
"hitParsingResult": [ {
"valid": true,
"parserMessage": [ ],
"hit": "/debug/collect?v=1..."
} ],
"parserMessage": [ {
"messageType": "INFO",
"description": "Found 1 hit in the request."
} ]
}
BUT in the production settings GA responses with 400 bad request error to the same requests without providing any details: "Your client has issued a malformed or illegal request. That’s all we know.".
So what might be wrong with Firefox UA?
UPD: I've managed to make this work by unsetting the 'User-Agent' header in case it contains 'Firefox' - and the corresponding 'ua' parameter in the payload gets accepted then.
if (strpos($requestHeaders['User-Agent'], 'Firefox') !== false) {
unset($requestHeaders['User-Agent']);
}
But it's still unclear what was wrong with such headers in the first place.

logstash geoip parsing using apache_log data failed

I am new to Elasticsearch..
I want to use apache_logs data to use geoip filter in logstash.
apache log data:
"83.149.9.216 - - [17/May/2015:10:05:03 +0000] \"GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1\" 200 203023 \"http://semicomplete.com/presentations/logstash-monitorama-2013/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\""
logstash.conf
input {
tcp {
port => 9900
}
}
filter {
grok {
match => { "message" => "%{IP:clientip}" }
}
geoip {
source => "clientip"
}
}
output {
stdout { }
}
and i got an error below..
Pipeline error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: GeoIP Filter in ECS-Compatiblity mode requires a `target` when `source` is not an `ip` sub-field, eg. [client][ip]>
....
Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Reload<main>, action_result: false", :backtrace=>nil}
here is my data output:
{
"#timestamp" => 2022-03-09T09:40:28.652491Z,
"clientip" => "83.149.9.216",
"#version" => "1",
"message" => "83.149.9.216 - - [17/May/2015:10:05:03 +0000] \"GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1\" 200 203023 \"http://semicomplete.com/presentations/logstash-monitorama-2013/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\"",
"event" => {
"original" => "83.149.9.216 - - [17/May/2015:10:05:03 +0000] \"GET /presentations/logstash-monitorama-2013/images/kibana-search.png HTTP/1.1\" 200 203023 \"http://semicomplete.com/presentations/logstash-monitorama-2013/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36\""
}
}
Could you guys help me to solve this problem?? Thanks.

How to prevent fake useragent detection in selenium headless?

I am running a scraping bot in headless mode. As you know it contains headless string in useragent when it's running in headless mode. To avoid that issue, I changed useragent. And the website detect this fake useragent and block scraping bot. How can I prevent this detection?
I am using selenium chromedriver.
Please add those options
# windows_useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# linux_useragent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("user-agent=#{linux_useragent}")
options.add_argument("--disable-web-security")
options.add_argument("--disable-xss-auditor")
options.add_option("excludeSwitches", ["enable-automation", "load-extension"])
navigator.platform and navigator.userAgent should be matched.
If userAgent is for windows, then navigator.platform should be "Win32"
If userAgent is for linux, then navigator.platform should be "Linux x86_64"
You can set like that
platform = {
windows: "Win32",
linux: "Linux x86_64"
}
driver.execute_cdp("Page.addScriptToEvaluateOnNewDocument", {
"source": "
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
}),
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
}),
Object.defineProperty(navigator, 'platform', {
get: () => \"#{platform[:linux]}\"
})"
})
and of course you need to set navigator.webdriver to undefined

Unable to scrape ajax loaded elements on a webpage python

I need to scrape a webpage the link to which is here
In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded.
Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?
The data is loaded through Javascript, but you can extract it with requests, BeautifulSoup and json module:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )
d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
if item['componentName'] == 'PdpWrapper':
d = item
break
if d:
cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
print(json.dumps(cross_reverence_product_tiles, indent=4))
Prints:
[
{
"partId": "16571604",
"partNumber": "CGB3B1X5R1A475M055AC",
"productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
"manufacturerName": "TDK",
"productLineTitle": "Capacitor Ceramic Multilayer",
"productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
"datasheetUrl": "",
"lowestPrice": 0.0645,
"lowestPriceFormatted": "$0.0645",
"highestPrice": 0.3133,
"highestPriceFormatted": "$0.3133",
"stockFormatted": "1,875",
"stock": 1875,
"attributes": [],
"buyingOptionType": "AddToCart",
"numberOfAttributesToShow": 1,
"rrClickTrackingUrl": null,
"pricingDataPopulated": true,
"sourcePartId": "V72:2272_06586404",
"sourceCode": "ACNA",
"packagingType": "Cut Strip",
"unitOfMeasure": "",
"isDiscontinued": false,
"productTileHint": null,
"tileSize": 1,
"tileType": "1x1",
"suplementaryClasses": "u-height"
},
...and so on.
Selenium alone will be enough to scrape the Cross References section inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='WideSidebarProductList-list']//h4")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']

Where did the default Useragent of Cobalt come from?

When run the cobalt, I can see the useragent from the log:
[0101/000230:INFO:application.cc(690)] User Agent: Mozilla/5.0 (DirectFB; Linux x86_64) Cobalt/4.13031-qa (unlike Gecko) Starboard/1
So where does it come from? Is there a way to change it?
The default useragent is set in the following file, you can have a check:
https://cobalt.googlesource.com/cobalt/+/e9b4b99dab6e774b8b6e63add74c352cc5dd395a/src/cobalt/network/user_agent_string_factory.cc
std::string UserAgentStringFactory::CreateUserAgentString() {
// Cobalt's user agent contains the following sections:
// Mozilla/5.0 (ChromiumStylePlatform)
// Cobalt/Version.BuildNumber-BuildConfiguration (unlike Gecko)
// Starboard/APIVersion,
// Device/FirmwareVersion (Brand, Model, ConnectionType)
// Mozilla/5.0 (ChromiumStylePlatform)
std::string user_agent =
base::StringPrintf("Mozilla/5.0 (%s)", CreatePlatformString().c_str());
// Cobalt/Version.BuildNumber-BuildConfiguration (unlike Gecko)
base::StringAppendF(&user_agent, " Cobalt/%s.%s-%s (unlike Gecko)",
COBALT_VERSION, COBALT_BUILD_VERSION_NUMBER,
kBuildConfiguration);
// Starboard/APIVersion,
if (!starboard_version_.empty()) {
base::StringAppendF(&user_agent, " %s", starboard_version_.c_str());
}
// Device/FirmwareVersion (Brand, Model, ConnectionType)
if (youtube_tv_info_) {
base::StringAppendF(
&user_agent, ", %s_%s_%s/%s (%s, %s, %s)",
youtube_tv_info_->network_operator.value_or("").c_str(),
CreateDeviceTypeString().c_str(),
youtube_tv_info_->chipset_model_number.value_or("").c_str(),
youtube_tv_info_->firmware_version.value_or("").c_str(),
youtube_tv_info_->brand.c_str(), youtube_tv_info_->model.c_str(),
CreateConnectionTypeString().c_str());
}
return user_agent;
}
If your SbSystemGetDeviceType() is true for SystemDeviceTypeIsTv() (in file user_agent_string_factory_starboard.cc), you can customize the UA by implementing some fields of SbSystemGetProperty() + some SbSystemGet() functions.
This is a typical example:
Mozilla/5.0 (1) Cobalt/11.119147-gold (unlike Gecko) Starboard/8, 2_8_6/5 (3, 4, 7)
where,
kSbSystemPropertyPlatformName
kSbSystemPropertyNetworkOperatorName
kSbSystemPropertyManufacturerName
kSbSystemPropertyModelName
kSbSystemPropertyFirmwareVersion
kSbSystemPropertyChipsetModelNumber
SbSystemGetConnectionType()
SbSystemGetDeviceType()

Resources