Scraping specific hyperlinks from a website using bash

Scraping specific hyperlinks from a website using bash - bash

I have a website containing several dozen hyperlinks in the following format :
<a href=/news/detail/1/hyperlink>textvalue</a>
I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.
The output should be in the following format :
textvalue
/news/detail/1/hyperlink

First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

Related

Use bash to extract data between two regular expressions while keeping the formatting

but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.

The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.

This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.

sed delete unmatched lines between two lines with bash variable

I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!

and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.

Make an arrays of links contained in one big string

I have a big string ( a html code from web page).
Now the problem is how to parse the links to images.
I want to make an array of all the links to images in that web page.
I know how to do this i java but I do not know how to do the parse strings and do a string manipulations in shell. I know there are many tricks and I guess this can be easy done.
in the end I want to get something like this
#!/bin/bash
read BIG_STRING <<< $(curl some_web_page_with_links_to_images.com)
#parse the big string and fill the LINKS variable
# fill this with the links to image somewhow (.jpg and .png only)
#after the parsing the LINKS should look like this
LINKS=("www.asd.com/asd1.jpg" "www.asd.com/asd.jpg" "www.asd.com/asd2123.jpg")
#I need the parsing and to fill the LINKS variable with the links from the web page
# get length of an array
tLen=${#LINKS[#]}
for (( i=0; i<${tLen}; i++ ));
do
echo ${LINKS[$i]}
done
Thanks, for the responses you saved me days of frustrations

Why not start with the right tool? Parsing HTML is hard, especially with sed. If you have the mojo tool from the Mojolicious project you can do this:
mojo get http://example.com a attr href
And then just check whether each line ends with jpg, png, or whatever.

It's hard to offer more than approximations. Let's assume all interesting links are href="" attributes, and there's at most one href attribute per line (And the links are also one line only, actually I'm not sure if newlines are allowed inside URLs.
Let's assume your sourcefile is called test.html.
The following should print all links under these assumptions:
sed -n 's/.*\<href="\([^"]*\)".*/\1/p' test.html
To understand how this works, you should know what regular expressions are and have read up a tutorial on sed (particularly how the s ubstitute command works)

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!

This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

how to get html from plain formatted text?

I have an array that looks like
Ant run name : Basics of Edumate
Overall result : pass
Ant run took: 4 minutes 13 seconds
--------------------------
Details for all test suits
--------------------------
login : Pass
AddCycleTemplate: Pass
AddCycleTemplate: Pass
AddAcademicYear : Pass
AddAcademicYear : Pass
AddCampus : Pass
Is there any easy way how I can convert this in ruby into html that keeps the formatting?

If these are the only kinds of lines you will ever see, then you could certainly write a Ruby script that does the following:
First, output a doctype declaration, an opening html tag, an head section, an opening body tag, and an opening table tag with whatever style you like.
Read your Ant output line by line. If the line has a colon in it, split it on the colon and output a table row with two columns (each side of the split). If the line does not have a colon write its text as colspan=2, perhaps with a style indicating a large margin-top and margin-bottom, except if it is all dashes in which case you should ignore it.
Output HTML to close the table and the body.
This is certainly a hack and not a general solution by any means, but hey if you are writing a tool just for yourself so you can have some pretty little ant outputs, go for it. This is no more than 20 lines of Ruby. Write it to read from stdin and write to stdout so you can pipe your ant output to it!

if you don't care about formatting, just encapsulate the entire thing in <pre></pre> tags and you're set. All new lines and whitespaces will be preserved and default font is a monospaced one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio