I want to get host of a web site like filehippo.com:
What I tried is:
#!/bin/bash
AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1'
page=1
for line in $(cat /tmp/IpList.txt)
do
REQUEST="http://www.bing.com/search?q=ip%3a108.168.208.206&qs=n&pq=ip%3a108.168.208.206&sc=0-0&sp=-1&sk=&first=1&FORM=PERE"
curl $REQUEST --user-agent "$AGENT"
let page=page+10
done
What I want:
I want to search in pages and get result, In this case I have 1 page but some of my servers have more than 1 pages..
Thank you
Related
This is the code I am trying to run in scrapy shell to get the headline of the article from dailymail.co.uk.
headline = response.xpath("//div[#id='js-article-text']/h2/text()").extract()
$ scrapy shell "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
Set up an user-agent with your request and it should work :
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0" "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
response.xpath("//div[#id='js-article-text']/h2/text()").extract()
Output :
Shia LaBeouf reveals his heavily tattoo torso as he goes shirtless for a run in hot pink shorts
I am currently trying to make a Wallpaper randomiser.
The rule that I have is to take the 9th image on google image from a random word selected and put it as the wallpaper. I am doing it on bash.
But when I do a wget on a google website, the common href for these link disappear and get replace (if I don't use the option -k they get replace by a # else they get replace by something that i can't read)
Here is my command:
wget -q -p -k --user-agent="Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0" -e robots=off $address
where $address is:
address="https://www.google.fr/search?q=wallpaper+$word&safe=off&biw=1920&bih=880&tbs=isz:ex,iszw:1920,iszh:1080&tbm=isch&source=lnt"
the link that I want to obtain is like
href="/imgres/imgurl="<Paste here an url image>"
I have some new information.
In fact google seems to change his url with javascript and other client technologies. Then i need some wget copy that interpret javascript before. Do some one know this?
My xidel command is the following:
xidel "https://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
This should extract all data in the divs with class consulentdetail
Nothing special I thought but it wont print anything.
Can anyone help me finding my mistake?
//EDIT: When I use the same expression in Firefox it finds the desired tags
The site you are connecting to obviously checks the user agent string and delivers different pages, according to the user agent string it gets sent.
If you instruct xidel to send a user agent string, impersonating as e.g. Firefox on Windows 10, your query starts to work:
> ./xidel --silent --user-agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0" "http://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
Lidnummer11484 2 N 73
TitelAccountant, Belastingconsulent
TaalNederlands
Accountant sinds4/04/2005
Belastingconsulent sinds4/04/2005
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
GSM+32 (474) 29.00.67
Websitehttp://abbeloosschinkels.be
E-mail
<!--
document.write("");document.write(decrypt(unescCtrlCh("5yÿÃ^à (pñ_!13!Â[îøû!13!5ãév¦Ãçj|°W"),"Iate1milrve%ster"));document.write("");
-->
As a rule of thumb, when doing Web scraping and getting weird results:
Check the page in a browser with Javascript disabled.
Send a user agent string simulating a Web browser.
I have a simple terminal shell command here:
curl -v -m 60 -H 'Accept-Language: en-US,en;q=0.8,nb;q=0.6' -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" http://imeidata.net -e http://imeidata.net/iphone/model-check http://imeidata.net/iphone/model-check?sn=C8PKTRF5DTC1
When I run it for first time I get full result (Page is about 23KB)
When I run this command for second time I get only a sample page
(about 17KB)
I am still able to visit this website, so my IP is not blocked but only cURL request are denied.
Again if I change my IP.. same will happen.
Why do my requests get blocked?
Any solution will be highly appreciated.
Thank you for helping.
I would fire it with "-k" # the end of the request.
So fire your request + " -k" like below:
curl -v -m 60 -H 'Accept-Language: en-US,en;q=0.8,nb;q=0.6' -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" http://imeidata.net -e http://imeidata.net/iphone/model-check http://imeidata.net/iphone/model-check?sn=C8PKTRF5DTC1 -k
I'm trying to use cURL to get data to the form an URL:
http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
which needs a cookie, so I first store it on a file:
curl -c cookie-jar http://example.com/site-explorer/overview/subdomains/example.com
Trying curl with these values:
curl -b cookie-jar -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --referer "http://example.com/site-explorer/overview/subdomains/example.com" http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
There is one problem which leaps out at me: You aren't quoting the URL, which means that characters such as & and ? will be interpreted by the shell instead of getting passed to curl. If you're using a totally static URL, enclose it in single quotes, as in 'http://blah.com/blah/blah...'.