I'm looking for a solution to replace all the links from a curl response to my site.
Lets say my site is: example.com, then I make a CURL request to site.com.
site.com has various links:
Something!
<some html>......
Google!
<more html>
Something else
My goal is to prefix all the links with: example.com/?url={THE URL OF THE LINK} (AKA my site).
My current solution uses regexp to "catch" and process all the links.
This works most of the time, but from time to time I encounter a non-valid HTML that fails the regex.
The regex has another disadvantage: I can't catch onclick="" actions and different link scenarios.
I heard several solutions such as rewrite and reverse proxy. Any of them can work to achieve my goal?
Thanks..
You should absolutely be able to use regex for that. However, your code will have to be a little more robust to handle inline scripting. Analyze a large sample of anchor attributes to determine all the possible link formats, over and above /href=""/ and /window.location.href/.
You will also have to parse referenced script files to see what the event handlers hold.
Related
I want to read out and later process a value from a website (Facebook Ads) from a bash script that runs daily. Unfortunately I need to be logged in to get this value:
So far I've figured out how to log into this website on Firefox and save the html file where the value could theoretically be read out:
The only unique identifier in this file is the first instance of "Gesamtausgaben". Is there any way with this information to cut out everything besides "100,10" ?
I'd also be happy for a different kind of way to get this value. And no, I don't have any API access.
I appreciate all ideas.
Thanks,
Patrick
How to Parse HTML (Badly) with PCRE
You can't reliably parse HTML with just regular expressions, so you'll need an XML/HTML or XPATH parser to do this properly. That said, if you have a PCRE-compatible grep then the following will likely work provided the HTML is minified and the class isn't re-used on your page.
$ pcregrep -o 'span class=".*_3df[ij].*>\K[^<]+' foo.html
100,10 €
If your target HTML spreads across multiple lines, or if you have multiple spans with the same classes assigned, then you'll have to do some work to refine the regular expression and differentiate between which matches are important to you. Context lines or subsequent matches may be helpful, but your mileage will definitely vary.
I am getting data from a broken RSS feed that gives me wrong link. I wanted to fix this link so I made this code:
<link.*>(.*)&.*tid(.*)</link>
and the link could be like:
www.somedomain.com/?value=50&burrrdurrrr;tid=120
But the real working link is in this form:
www.somedomain.com/?value=50&tid=120
The thing that I'm asking is if my measure thing looks like this:
[FeedURL]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=[Feed]
StringIndex=2 ;now I only get www.somedomain.com/?value=50
Substitute=#SubstituteFeed#
How am I supposed to concatenate the strings together to complete the url?
I'm guessing rather than &burrrdurrrr;, the link has &, which is how you have to write & in an HTML or XML file.
If that's the case, you just need to set the DecodeCharacterReference option, as described in this handy-looking tutorial. Another option mentioned there is Substitute, which would be able to strip it out even if it really was &burrrdurrrr;.
None of this is a particularly sensible way of dealing with HTML or XML - a much better approach would be a plugin which actually parsed the document structure and let you reference nodes using XPath or CSS rules - but you work with what you've got, I guess. (I've never heard of this "Rainmeter" before, despite its claim to be "the best known and most popular desktop customization program for Windows"; maybe because nobody else calls their program that, instead almost universally using the word "widget"?)
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
instead of Crawl Web operator use the Process Documents from Web
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
put the Extract Information subprocess inside the Process Documents from Web which has been created previously.
ExampleSet will contain the links and the attributes matching the XPath queries.
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)
I have a website that renders the URL:
/work.php?cat=identity
Normally I would research how to use mod_rewrite but unfortunately my hosting (Namesco) uses Zeus and not Apache, which is strange. How would I use Zeus' rewrite rules to convert to:
/work/identity
This is a much cleaner, nicer SEO friendly version. On top of this, I still need the $_GET variable to be active because it requests information about the variable cat from the database.
I've never rewritten URLs before so I've no idea where to begin. I've attempted the change with this rewrite.script file which is saved within my web folder
match URL into $ with ^/work.php?cat=/(.*)
if matched set URL= /work/$
Unfortunately it doesn't work. Can anyone help or perhaps offer an alternative?
had a quick play with this, and I believe I have proven to myself that the Request Rewriting is not able to manipulate the query element of the URL.
There is a potential solution, but it gets even more ugly!
You could use the "Perl Extensions" of ZWS to achieve this. Essentially you pass the request to the Perl engine within ZWS run a script against it, then pass the result back to the ZWS.
I am afraid this is a bit beyond my capabilities however! I am a "Zeus Traffic Manager" sort of chap...
Nick
Zeus Rewrite Rules are able to access the query part of a URL string. The key thing your missing it looks like is the 1 following the $ on the output URL and the slash should be removed:
match URL into $ with ^/work.php?cat=/(.*)
if matched set URL= /work/$
should be
match URL into $ with ^/work.php?cat=(.*)
if matched set URL= /work/$1
I am wondering if the rewrite rules are available for the query portion of the URI? The docs do seem to only speak about the path element.
http://support.zeus.com/zws/docs/2005/12/16/zeus_web_server_4_3_documentation
page 141 seems to be the start of it...
I will attempt to fire up a ZWS VM and test this myself.
Nick
I'd like to strip a URL of it's query string using mod_rewrite but retain the values of the querystring, for example, id like to change:
http://new.app/index.php?lorem=1&ipsum=2
to a nice clean:
http://new.app/
but retain the values of lorem and ipsum, so inside index.php:
$_GET["lorem"]
would still return 1 etc.
This is my first dabble with mod_rewrite so any help is greatly appreciated, and if you could explain exactly how your solution works, I can learn a little for next time too!
Thanks!
As Roland mentioned, you don't seem to understand the way rewriting works. It's typically done using Apache mod_rewrite in .htaccess, which silently rewrites the pretty URLs to the php script as /index.php?lorem=1&ipsum=2
Even Joomla uses .htaccess, except it has a single rewrite rule that passes EVERYTHING to a PHP script which does the actual rewriting in PHP.
What you are not understanding is that something still needs to exist in the "pretty" version for the php script to pull the value of $_GET["lorem"]
So it would be like http://new.app/lorem/ or http://new.app/section/lorem which would then (using mod_rewrite in .htaccess) rewrite TO the php script.
I don't understand exactly what you want. Your first URL is the external form, which the users see and can type into their browsers.
The second form has almost all information stripped, so when you send that to a server, how is the server supposed to know that lorem=1&ipsum=2?
If your question is really
How do I make the URLs in the browser look nice, even if the user is somewhere deep in the website clicking on URLs that carry lots of information?
then there are two solutions:
You can pass the information in small bits to the server and save them all in a session. I don't like that because then the user cannot take the URL, show it to a friend and have him see the same page.
You can have your entire web site in an HTML <frameset> containing only one <frame>. That way, the URL of the top-level window will not change, only the inner URL (which is not displayed by the browser) will.