Xpath expression returns empty output - xpath

My xidel command is the following:
xidel "https://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
This should extract all data in the divs with class consulentdetail
Nothing special I thought but it wont print anything.
Can anyone help me finding my mistake?
//EDIT: When I use the same expression in Firefox it finds the desired tags

The site you are connecting to obviously checks the user agent string and delivers different pages, according to the user agent string it gets sent.
If you instruct xidel to send a user agent string, impersonating as e.g. Firefox on Windows 10, your query starts to work:
> ./xidel --silent --user-agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0" "http://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
Lidnummer11484 2 N 73
TitelAccountant, Belastingconsulent
TaalNederlands
Accountant sinds4/04/2005
Belastingconsulent sinds4/04/2005
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
GSM+32 (474) 29.00.67
Websitehttp://abbeloosschinkels.be
E-mail
<!--
document.write("");document.write(decrypt(unescCtrlCh("5yÿÃ^à(pñ_!13!­[îøû!13!5ãév¦Ãçj|°W"),"Iate1milrve%ster"));document.write("");
-->
As a rule of thumb, when doing Web scraping and getting weird results:
Check the page in a browser with Javascript disabled.
Send a user agent string simulating a Web browser.

Related

Get link to an image on google image Wget

I am currently trying to make a Wallpaper randomiser.
The rule that I have is to take the 9th image on google image from a random word selected and put it as the wallpaper. I am doing it on bash.
But when I do a wget on a google website, the common href for these link disappear and get replace (if I don't use the option -k they get replace by a # else they get replace by something that i can't read)
Here is my command:
wget -q -p -k --user-agent="Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0" -e robots=off $address
where $address is:
address="https://www.google.fr/search?q=wallpaper+$word&safe=off&biw=1920&bih=880&tbs=isz:ex,iszw:1920,iszh:1080&tbm=isch&source=lnt"
the link that I want to obtain is like
href="/imgres/imgurl="<Paste here an url image>"
I have some new information.
In fact google seems to change his url with javascript and other client technologies. Then i need some wget copy that interpret javascript before. Do some one know this?

curl to bing web site in bash from list

I want to get host of a web site like filehippo.com:
What I tried is:
#!/bin/bash
AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1'
page=1
for line in $(cat /tmp/IpList.txt)
do
REQUEST="http://www.bing.com/search?q=ip%3a108.168.208.206&qs=n&pq=ip%3a108.168.208.206&sc=0-0&sp=-1&sk=&first=1&FORM=PERE"
curl $REQUEST --user-agent "$AGENT"
let page=page+10
done
What I want:
I want to search in pages and get result, In this case I have 1 page but some of my servers have more than 1 pages..
Thank you

processing a pcap/dmp file for time-to-live, user-agent, and OS

I'm trying to generate a report of the number of clients/devices behind a given NAT gateway using the techniques discussed in this paper.
Basically I need to write a script which looks for both 'User-Agent' and 'Time to live' at the same time:
grep " User-Agent:" *.txt
grep " Time to live:" *.txt
Those are exactly how the lines are formatted in my output files and I'm happy having the text to the end of the line. They work separately but I haven't been successful in combining them.
My most recent attempts have been:
egrep -w ' User-Agent:'|' Time to live:' ../*.txt
grep ' User-Agent:' ../*.txt' && grep ' Time to live:' ../*.txt
(I've been manually exporting text format files from Wireshark, if anyone has a suggestion for doing that via script I would be most grateful, I have a HUGE number of files to do this for.)
I looked for a similar thread but I didn't find one, if one already exists (as I expect) I apologize, whether someone can supply me a link to assistance or provide it I would be most grateful.
EDIT: I thought I should mention, the two phrases I'm looking for are on lines separated by other data so a solution would need to search for both in an example like so:
User-Agent:
blahblahblah:
halbhalbhalb:
Time to live:
egrep ' User-Agent:| Time to live:' ../*.txt gives me:
desktop:~/Documents/scripts$ ./pcap_ttl_OS_useragent
../scripttextfile1p.txt: Time to live: 128
../scripttextfile1p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
../scripttextfile2p.txt: Time to live: 55
../scripttextfile3p.txt: Time to live: 128
../scripttextfile3p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
egrep ' User-Agent:| Time to live:' ../*.txt
should work.
I don't think the -w is getting you any "functionality".
Also, you want to quote the whole "extended" regular expression, inluding the | alternation character as one string.
Finally, it's not clear if your leading white space for each field is the result of a tab char or a group of spaces. That would affect the correct text string to put into the search patterns. To confirm white-space type, i like to use
grep 'User-Agent' ../*.txt | head -1 | cat -vet
will show ether
..... User-Agent ....
OR
.....^IUser-Agent .....
The ^I being the representation for the tab character.
IHTH

Using cURL to replicate browser request

I'm trying to use cURL to get data to the form an URL:
http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
which needs a cookie, so I first store it on a file:
curl -c cookie-jar http://example.com/site-explorer/overview/subdomains/example.com
Trying curl with these values:
curl -b cookie-jar -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --referer "http://example.com/site-explorer/overview/subdomains/example.com" http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
There is one problem which leaps out at me: You aren't quoting the URL, which means that characters such as & and ? will be interpreted by the shell instead of getting passed to curl. If you're using a totally static URL, enclose it in single quotes, as in 'http://blah.com/blah/blah...'.

How to scrape a _private_ google group?

I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.
Since this is a private group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless.
My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on."
I got the same output with wget. See the bash script at the end of this message.
I don't know how to workaround this. am I missing something? Any idea?
Thanks in advance.
John
Here is the ruby script:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='john#gmail.com'
password='topsecret'
# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
Here is the bash script:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='Email=john#gmail.com&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
Have you tried with mechanize for ruby?
Mechanize library is used for automating interaction with website; you could log in to google and browse your private google group saving what you need.
Here an example where mechanize is used for gmail scraping.
I did this previously by logging in manually with Firefox and then used Chickenfoot to automate browsing and scraping.
Found this PHP Solution to scraping private Google Groups.

Resources