I am working on a project using a bash shell script. The idea is to grep a wget retrieved page, in order to pick up a certain paragraph on the web page. The area I would like to copy, usually starts with a
<p><b>
but the paragraph also contains other bits of HTML code, such as anchor tags, that I don't want to be in the output of the grep.
I have tried
cat page.html| grep "<p><b>" >grep.txt
and then I grep the output file, which now contains the paragraph I want
cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt
but then all it does is clear everything from the file and not read anything. How can I get it to exclude only the HTML code?
I am also trying to follow the links that are in the paragraph that I grep, in order to do the same thing with those pages. Only 2 levels deep, so the main page and then what ever sub page(s) stem from the first paragraph of the main page. I know this is a difficult idea, hopefully I explained well enough to get some help. If you have any ideas, any help is appreciated.
Do you have to do this in bash? It seems to me that Python would lend itself to this problem, in particular a library called Beautiful Soup.
I've used this for parsing HTML in the past and it's the easiest tool I could find. It has good documentation for dealing with html.
Perhaps you could make a standalone python code that extracts the HTML and then echos the string you're after. The python code could then be called from inside your bash script if you have some bash functions you want to perform on the string.
I know this is 7 years old but just posting solution I have with bash
https://api.jquery.com/jquery.grep/
Related
I want to read out and later process a value from a website (Facebook Ads) from a bash script that runs daily. Unfortunately I need to be logged in to get this value:
So far I've figured out how to log into this website on Firefox and save the html file where the value could theoretically be read out:
The only unique identifier in this file is the first instance of "Gesamtausgaben". Is there any way with this information to cut out everything besides "100,10" ?
I'd also be happy for a different kind of way to get this value. And no, I don't have any API access.
I appreciate all ideas.
Thanks,
Patrick
How to Parse HTML (Badly) with PCRE
You can't reliably parse HTML with just regular expressions, so you'll need an XML/HTML or XPATH parser to do this properly. That said, if you have a PCRE-compatible grep then the following will likely work provided the HTML is minified and the class isn't re-used on your page.
$ pcregrep -o 'span class=".*_3df[ij].*>\K[^<]+' foo.html
100,10 €
If your target HTML spreads across multiple lines, or if you have multiple spans with the same classes assigned, then you'll have to do some work to refine the regular expression and differentiate between which matches are important to you. Context lines or subsequent matches may be helpful, but your mileage will definitely vary.
I've searched for this very question on this site, but either they use REGEX ( I have no clue how to use that in BASH, or it's not quite similar enough of a problem so I can use the examples.
Basically, I have an html file that has the info I need in a set of parenthesis.
for example:
Merry Christmas (english)
Feliz Navidad (spanish)
I'm trying to take the data from the html and put it into either a string or echo it out to a filename for comparison.
Any suggestions?
You question is very vague so it's hard to tell what are exactly your requirements, but the following command will find and print all the parentheses from the file:
grep -oP '\([^)]+\)' input.html
I am at the edge of frustration thats why my way of explain myself might be rough.
I am trying to remove a column from a tab delimited text files. I have 10 files with same pattern of names such as (written below), thats why when I use such commands like cut I get frustrated with using tab many times.
So my questions;
1) Is there a easier method of basically copying the file name for overwriting?
cut -f 1,3 oldfile.txt > oldfile.txt
btw when I do the method above, my all lines got erased for some reason.
2) Since I dont know the regular expression well enough, is there a way to go around of awk ? that language seems pretty complicated for me.
3)What is the most memory efficient method for his task? Please tell me a method then I will learn that method even its as hard as learning Chinese.
I am a unix frasturared mediocare. sorry for my bad england.
Best regards,
file name example: KKK12398801_normal_912839.txt
I am getting data from a broken RSS feed that gives me wrong link. I wanted to fix this link so I made this code:
<link.*>(.*)&.*tid(.*)</link>
and the link could be like:
www.somedomain.com/?value=50&burrrdurrrr;tid=120
But the real working link is in this form:
www.somedomain.com/?value=50&tid=120
The thing that I'm asking is if my measure thing looks like this:
[FeedURL]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=[Feed]
StringIndex=2 ;now I only get www.somedomain.com/?value=50
Substitute=#SubstituteFeed#
How am I supposed to concatenate the strings together to complete the url?
I'm guessing rather than &burrrdurrrr;, the link has &, which is how you have to write & in an HTML or XML file.
If that's the case, you just need to set the DecodeCharacterReference option, as described in this handy-looking tutorial. Another option mentioned there is Substitute, which would be able to strip it out even if it really was &burrrdurrrr;.
None of this is a particularly sensible way of dealing with HTML or XML - a much better approach would be a plugin which actually parsed the document structure and let you reference nodes using XPath or CSS rules - but you work with what you've got, I guess. (I've never heard of this "Rainmeter" before, despite its claim to be "the best known and most popular desktop customization program for Windows"; maybe because nobody else calls their program that, instead almost universally using the word "widget"?)
I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?
RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)
Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).
It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.