Make an arrays of links contained in one big string - image

I have a big string ( a html code from web page).
Now the problem is how to parse the links to images.
I want to make an array of all the links to images in that web page.
I know how to do this i java but I do not know how to do the parse strings and do a string manipulations in shell. I know there are many tricks and I guess this can be easy done.
in the end I want to get something like this
#!/bin/bash
read BIG_STRING <<< $(curl some_web_page_with_links_to_images.com)
#parse the big string and fill the LINKS variable
# fill this with the links to image somewhow (.jpg and .png only)
#after the parsing the LINKS should look like this
LINKS=("www.asd.com/asd1.jpg" "www.asd.com/asd.jpg" "www.asd.com/asd2123.jpg")
#I need the parsing and to fill the LINKS variable with the links from the web page
# get length of an array
tLen=${#LINKS[#]}
for (( i=0; i<${tLen}; i++ ));
do
echo ${LINKS[$i]}
done
Thanks, for the responses you saved me days of frustrations

Why not start with the right tool? Parsing HTML is hard, especially with sed. If you have the mojo tool from the Mojolicious project you can do this:
mojo get http://example.com a attr href
And then just check whether each line ends with jpg, png, or whatever.

It's hard to offer more than approximations. Let's assume all interesting links are href="" attributes, and there's at most one href attribute per line (And the links are also one line only, actually I'm not sure if newlines are allowed inside URLs.
Let's assume your sourcefile is called test.html.
The following should print all links under these assumptions:
sed -n 's/.*\<href="\([^"]*\)".*/\1/p' test.html
To understand how this works, you should know what regular expressions are and have read up a tutorial on sed (particularly how the s ubstitute command works)

Related

Scraping specific hyperlinks from a website using bash

I have a website containing several dozen hyperlinks in the following format :
<a href=/news/detail/1/hyperlink>textvalue</a>
I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.
The output should be in the following format :
textvalue
/news/detail/1/hyperlink
First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

breakable slashes everywhere but URLs

I generate pdf (latex) from restructured text using python sphinx (1.4.6) .
I use narrow table column headers with texts like "stuff/misc/other". I need the slashes to be breakable, so the table headers don't overflow into the next column.
The LaTeX solution is to use \BreakableSlash or \slash where necessary. I can use python code to replace all slashes:
from sphinx.util.texescape import tex_replacements
# \BreakableSlash needs package hyphenat to be loaded
tex_replacements.append((u'/', ur'\BreakableSlash ') )
# tex_replacements.append((u'/', ur'\slash ') )
But that will break any URL like http://www.example.com/ into something like
http:\unhbox\voidb#x\penalty\#M\hskip\z#skip/\discretionary{-}{}{}\penalty\#M\hskip\z#skip\unhbox\voidb#x\penalty\#M\hskip\z#skip/\discretionary{-}{}{}\penalty\#M\hskip\z#skipwww.example.com
or
http:/\penalty\exhyphenpenalty/\penalty\exhyphenpenaltywww.example.com
I'd like to use a general solution that works in both cases, where the editor of the documentation can still use normal ReST and doesn't have to worry about latex.
Any idea how to get classic slashes in URLs and breakable slashes everywhere else?
You have not really given data and source code and only asked for an idea, so I take the liberty of only sketching a solution in pseudo code:
Split the document into a list of strings at each position of a space using .split()
For each string, check whether it is an URL by comparing its left side to http:// (and maybe also ftp://, https:// or similar tags)
Do replacements, but only in strings which are no URLs
Recombine all strings including the spaces again, using a command such as " ".join(my_list)
One way to do it, might be to write a Transform subclass. And then use add transform in setup(app) to use it in every read.
I could use DefaultSubstitutions from transforms.py as template for my own class.

Use bash to extract data between two regular expressions while keeping the formatting

but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.
The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.
This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.

bash: how to make a clickable link with query parameters?

I am trying to dump url on the terminal that needs to be clickable, and the url comes with a query parameter. For example --
google='https://www.google.com/search?q='
orgname='foo bar'
gsearch=$google\'$orgname\'
echo "details: $orgname ($gsearch)"
But the problem is that the clickable link totally omits everything after the q=, i.e. does not include the string 'foo bar', please see the image below --
How do I make a clickable link that includes the query (i.e. the whole url in the braces above)?
Please also note that I am adding quote in the search parameter since the it may contain spaces.
Single quotes are not valid in URLs. Use the URL encoding %27 instead:
google='https://www.google.com/search?q='
orgname='foo'
gsearch=$google%27$orgname%27
echo "details: $orgname ($gsearch)"
Note that it's the terminal and not your script that decides what's considered part of a URL for the purpose of selecting or clicking. The above results in
https://www.google.com/search?q=%27foo%27
which is more clickable in most terminals. The script can't specify what's the extent of the URL except through expressing it in such a standard way that each individual terminal emulator has a decent chance of recognizing it.
PS: I don't think Google cares about surrounding single quotes.

removing <br/> from GET request

I'm using a get request to get some page data but need to strip the break tags from the finished file. Basically what I'm doing is taking the output of the get request and saving it to a file but it has hundereds of break tags in it I need removed. I'm fine with running a batch or vb script after the file is saved to remove the tags but I'm not sure how on how to do that either. So far the only solutions I have seen is to remove entire lines.
EDIT: This will be deployed to multiple Windows servers so I would like to keep the requirements as minimal as possible. I.E. commands/software that Windows has by default.
If you're au fait with Python, you could use Beautiful Soup to remove <br /> elements in a fairly robust manner. See here for how to remove elements from the tree.
Unless I have misunderstood you could replace the break tags using the replace function in vbscript (assumed from the tag). For example:
cleanedText = Replace(rawText,"<br/>",""))
More information on usage can be found here
http://www.w3schools.com/Vbscript/func_replace.asp
It is worth mention though that that function acts verbatim so you might have to run through a few times to get all common tag markup:
cleanedText = Replace(rawText,"<br/>","")) //no spaces
cleanedText = Replace(cleanedText,"<br />","")) // a space
cleanedText = Replace(cleanedText,"<br>","")) // unterminated

Resources