Grep title of a page which is written with spaces [duplicate] - bash

This question already has answers here:
Parsing XML using unix terminal
(9 answers)
Closed last month.
I am trying to get the meta title of some website...
some people write title like
`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`
`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`
`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`
some like more ways... my current focus on above 3 ways...
I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,
`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`
I also made a code (very bad i guess)
where i can grep number of line like
`
% curl -s https://allheartweb.com/ | grep -n '<title>'
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology
% curl -s https://allheartweb.com/ | grep -n '</title>'
8:</title>
`
and store it and run loop to get title item... which i guess a bad idea...
any help I can get all possible of getting title?

Try this:
curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'
You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.
This is quite a simple approach of course. xmllint would be better but that's not available to all platforms by default.

'grep' is not a very good tool to match multiple lines. It is processing line-by-line. You could hack that by making your incoming text one line like
curl -s https://allheartweb.com/ | xargs | grep -o -E "<title>.*</title>"
This is probably what you want.

Try this sed:
curl -s https://allheartweb.com/ | sed -n "{/<title>/,/<\/title>/p}"

Related

how can I scrap data from reddit (in bash)

I want to scrap titles and date from http://www.reddit.com/r/movies.json in bash
wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po '(?<="title": ").*?(?=",)' | sed 's/\"/"/'
I have titles but I don't know how to add dates, can someone help?
wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po
'(?<="title": ").*?(?=",)' | sed 's/"/"/'
As extension suggest it is JSON (application/json) file, therefore grep and sed are poorly suited for working with it, as they are mainly for using regular expressions. If you are allowed to install tools, jq should be handy here. Try using your system package manager to install it, if it succeed you should get pretty printed version of movies.json by doing
wget -q -O - "http://www.reddit.com/r/movies.json" | jq
and then find where interesting values are placed which should allow you to grab it. See jq Cheat Sheet for example of jq usage. If you are limited to already installed tools I suggest taking look at json module of python.

Does grep support the OR in a group?

I am looking at this question: https://leetcode.com/problems/valid-phone-numbers/
which asked using a cmd to extract the phone numbers.
I found this command works:
cat file.txt | grep -Eo '^(\([0-9]{3}\) ){1}[0-9]{3}-[0-9]{4}$|^([0-9]{3}-){2}[0-9]{4}$'
while this failed:
cat file.txt | grep -E '(^(\([0-9]{3}\))|^([0-9]{3}-))[0-9]{3}-[0-9]{4}'
I don't know why the second failed. Does it because grep doesn't support OR in a group?
No, it's because you dropped the space, so space in a phone number will no longer be allowed.
Also, the grouping in your regex seems to be off by a whack or two. What are you actually trying to express?
Finally, you have a useless use of cat -- grep can perfectly well read one or more input files without the help of cat.

Get Macbook screen size from terminal/bash

Does anyone know of any possible way to determine or glean this information from the terminal (in order to use in a bash shell script)?
On my Macbook Air, via the GUI I can go to "About this mac" > "Displays" and it tells me:
Built-in Display, 13-inch (1440 x 900)
I can get the screen resolution from the system_profiler command, but not the "13-inch" bit.
I've also tried with ioreg without success. Calculating the screen size from the resolution is not accurate, as this can be changed by the user.
Has anyone managed to achieve this?
I think you could only get the display model-name which holds a reference to the size:
ioreg -lw0 | grep "IODisplayEDID" | sed "/[^<]*</s///" | xxd -p -r | strings -6 | grep '^LSN\|^LP'
will output something like:
LP154WT1-SJE1
which depends on the display manufacturer. But as you can see the first three numbers in this model name string imply the display-size: 154 == 15.4''
EDIT
Found a neat solution but it requires an internet connection:
curl -s http://support-sp.apple.com/sp/product?cc=`system_profiler SPHardwareDataType | awk '/Serial/ {print $4}' | cut -c 9-` |
sed 's|.*<configCode>\(.*\)</configCode>.*|\1|'
hope that helps
The next script:
model=$(system_profiler SPHardwareDataType | \
/usr/bin/perl -MLWP::Simple -MXML::Simple -lane '$c=substr($F[3],8)if/Serial/}{
print XMLin(get(q{http://support-sp.apple.com/sp/product?cc=}.$c))->{configCode}')
echo "$model"
will print for example:
MacBook Pro (13-inch, Mid 2010)
Or the same without perl but more command forking:
model=$(curl -s http://support-sp.apple.com/sp/product?cc=$(system_profiler SPHardwareDataType | sed -n '/Serial/s/.*: \(........\)\(.*\)$/\2/p')|sed 's:.*<configCode>\(.*\)</configCode>.*:\1:')
echo "$model"
It is fetched online from apple site by serial number, so you need internet connection.
I've found that there seem to be several different Apple URLs for checking this info. Some of them seem to work for some serial numbers, and others for other machines.
e.g:
https://selfsolve.apple.com/wcResults.do?sn=$Serial&Continue=Continue&num=0
https://selfsolve.apple.com/RegisterProduct.do?productRegister=Y&country=USA&id=$Serial
http://support-sp.apple.com/sp/product?cc=$serial (last 4 digits)
https://selfsolve.apple.com/agreementWarrantyDynamic.do
However, the first two URLs are the ones that seem to work for me. Maybe it's because the machines I'm looking up are in the UK and not the US, or maybe it's due to their age?
Anyway, due to not having much luck with curl on the command line (The Apple sites redirect, sometimes several times to alternative URLs, and the -L option doesn't seem to help), my solution was to bosh together a (rather messy) PHP script that uses PHP cURL to check the serials against both URLs, and then does some regex trickery to report the info I need.
Once on my web server, I can now curl it from the terminal command line and it's bringing back decent results 100% of the time.
I'm a PHP novice so I won't embarrass myself by posting the script up in it's current state, but if anyone's interested I'd be happy to tidy it up and share it on here (though admittedly it's a rather long winded solution to what should be a very simple query).
This info really should be simply made available in system_profiler. As it's available through System Information.app, I can't see a reason why not.
Hi there for my bash script , under GNU/Linux : I make the follow to save
# Resolution Fix
echo `xrandr --current | grep current | awk '{print $8}'` >> /tmp/width
echo `xrandr --current | grep current | awk '{print $10}'` >> /tmp/height
cat /tmp/height | sed -i 's/,//g' /tmp/height
WIDTH=$(cat /tmp/width)
HEIGHT=$(cat /tmp/height)
rm /tmp/width /tmp/height
echo "$WIDTH"'x'"$HEIGHT" >> /tmp/Resolution
Resolution=$(cat /tmp/Resolution)
rm /tmp/Resolution
# Resolution Fix
and the follow in the same script for restore after exit from some app / game
in some S.O
This its execute command directly
ResolutionRestore=$(xrandr -s $Resolution)
But if dont execute call the variable with this to execute the varible content
$($ResolutionRestore)
And the another way you can try its with the follow for example
RESOLUTION=$(xdpyinfo | grep -i dimensions: | sed 's/[^0-9]*pixels.*(.*).*//' | sed 's/[^0-9x]*//')
VRES=$(echo $RESOLUTION | sed 's/.*x//')
HRES=$(echo $RESOLUTION | sed 's/x.*//')

reverse geocoding in bash

I have a gps unit which extracts longitude and latitude and outputs as a google maps link
http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false
From this i'd like to call it via curl and display the "short name" in line 20
"short_name" : "Northwood",
so i'd just like to be left with
Northwood
so something like
curl -s http://maps.googleapis.com/maps/api/geocode/xml?latlng=latlng=51.601154,-0.404765&sensor=false sed sort_name
Mmmm, this is kind of quick and dirty:
curl -s "http://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&sensor=false" | grep -B 1 "route" | awk -F'"' '/short_name/ {print $4}'
Bedford Avenue
It looks for the line before the line with "route" in it, then the word "short_name" and then prints the 4th field as detected by using " as the field separator. Really you should use a JSON parser though!
Notes:
This doesn't require you to install anything.
I look for the word "route" in the JSON because you seem to want the road name - you could equally look for anything else you choose.
This isn't a very robust solution as Google may not always give you a route, but I guess other programs/solutions won't work then either!
You can play with my solution by successively removing parts from the right hand end of the pipeline to see what each phase produces.
EDITED
Mmm, you have changed from JSON to XML, I see... well, this parses out what you want, but I note you are now looking for a locality whereas before you were looking for a route or road name? Which do you want?
curl -s "http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false" | grep -B1 locality | grep short_name| head -1|sed -e 's/<\/.*//' -e 's/.*>//'
The "grep -B1" looks for the line before the line containing "locality". The "grep short_name" then gets the locality's short name. The "head -1" discards all but the first locality if there are more than one. The "sed" stuff removes the <> XML delimiters.
This isn't text, it's structured JSON. You don't want the value after the colon on line 12, you want the value of short name in the address_component with type 'route' from the result.
You could do this with jsawk or python, but it's easier to get it from XML output with xmlstarlet, which is lighter than python and more available than jsawk. Install xmlstarlet and try:
curl -s 'http://maps.googleapis.com/maps/api/geocode/xml?latlng=40.714224,-73.961452&sensor=false' \
| xmlstarlet sel -t -v '/GeocodeResponse/result/address_component[type="route"]/short_name'
This is much more robust than trying to parse JSON as plaintext.
The following seems to work assuming you always like the short_name at line 12:
curl -s 'http://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&sensor=false' | sed -n -e '12s/^.*: "\([a-zA-Z ]*\)",/\1/p'
or if you are using the xml api and wan't to trap the short_name on line 20:
curl -s 'http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false' | sed -n -e '19s/<short_name>\([a-zA-Z ]*\)<\/short_name>/\1/p'

Extract .co.uk urls from HTML file

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here

Resources