Search specific text with DOM XPath - xpath

I have been trying to crawl a website pages and search for specific text using simple html dom and XPath. I have get all the links from website and trying to crawl that links and search text on all pages. The text that i want to search is within html span tag.
But no output is shown.
whats going wrong ?
here is my code
<?php
include_once("simple_html_dom.php");
set_time_limit(0);
$path='http://www.barringtonsports.com';
$html = file_get_contents($path);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++ ){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$nurl = $path.$url;
$html1 = file_get_contents($nurl);
$dom1 = new DOMDocument();
#$dom1->loadHTML($html1);
$xpath1 = new DOMXPath($dom1);
$name = $xpath1->evaluate("//span[contains(.,'Asics Gel Netburner 15 Netball Shoes')]");
if($name)
echo"text found";
}
?>
I just want to check the whether text "Asics Gel Netburner 15 Netball Shoes" exist in any page of the website www.barringtonsports.com or not.

You're querying a lot of web-pages interactively. It takes more time than your server is allowed to use for generating pages.
You can execute this script from command-line to avoid timeouts or you can try to configure PHP and WebServer so they give more time to the script (you can ask on https://serverfault.com/ how to do this)

Well, first off you are mixing Simple HTML DOM and DOM Document. Just use one or the other. Since this is in the simple-html-dom tag start with this from the command line:
<?php
require_once("./simple_html_dom.php"); # simplehtmldom.sourceforge.net to use manual
$path="http://www.barringtonsports.com";
$html = file_get_html($path);
foreach ($html->find('a') as $anchor) {
$url = $anchor->href;
echo "Found link to " . $url . "\n";
# now see if the link is relative, absolute, or even on another site...
$checkhtml = file_get_html($url);
# now you can parse that link for stuff too.
}
?>
But really, that website has a search form, why not just send it a query instead and read the results?

Related

how can I apply xss filter on output of laravel views?

I know I can use {{{}}} for escape all html tags from output texts, but I want to escape only unsafe tags not all tags (for example I want to use br tag in the text)
You should definitely implement it by yourself. I'm assuming that the tags you want to escape are probably just <script> and <iframe>, however in my opinion it is more appropriate to remove entirely that content instead of keeping escaped content on your page for no reason.
You could use regex for simple substitution, something like
$html = preg_replace("/<iframe.*?>/", "", $html);
$html = preg_replace("/<script(.*?)>(.*?)<\/script>/", "", $html);
However it's considered bad practice because the perfect regex expression doesn't exist, so you could have a breach in your security.
A better idea would be using the PHP DOMDocument Parser. You can do something like this to remove script tags:
$doc = new DOMDocument();
$doc->loadHTML($html);
$script_tags = $doc->getElementsByTagName('script');
for ($i = 0; $i < $script_tags->length; $i++) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
$clean_html = $doc->saveHTML();

Append variables to URL after pagination's '/page/x'

I'm using a couple URI variables to handle sorting a table, like this
.../page/7/sortby/serial_number/orderby/desc
as you can see, I'm also using the built in CI pagination library. My problem right now is that the links created with $this->pagination->create_links(); strip off sorting variables from the URI, making it difficult to maintain these sorting options between pages.
How can I go about appending these variables sortby/foo/orderby/bar to the URI of links created by the pagination library?
You can use the base_url option, and the page number segments will have to be last. It's a little annoying, but I think it's the simplest way.
// Get the current url segments
$segments = $this->uri->uri_to_assoc();
// Unset the "page" segment so it's not there twice
$segments['page'] = null;
// Put the uri back together
$uri = $this->uri->assoc_to_uri($segmenmts);
$config['base_url'] = 'controller/method/'.$uri.'/page/';
// other config here
$this->pagination->initialize($config);
I found the answers thanks to WesleyMurch leading me in the right direction. In order to always have the page variable as the last in the uri (which is necessary when using CI's pagination library), I used this
$totalseg = $this->uri->total_segments();
$config['uri_segment'] = $totalseg;
then following WesleyMurch's idea, I rebuilt the base_url,
$segments = $this->uri->uri_to_assoc();
unset($segments['page']); //so page doesn't show up twice
$uri = $this->uri->assoc_to_uri($segments);
$config['base_url'] = site_url()."/controller/method/".$uri."/page/";
and of course initialize the pagination with all the correct config options
$this->pagination->initialize($config);
I use the answer of ejfrancis but...
If for some reason the user put not numeric or negative number in the url's page var, i suggest make a validation before set the $config['uri_segment'], like this one:
$totalseg = $this->uri->segment($totalseg)>0 &&
is_numeric($this->uri->segment($totalseg))?
$totalseg : NULL;
I hope it help!

Rewrite rules in the .htaccess file

The request is simple, however, I cannot find a way to implement it. I have links like:
httр://mysite.com/index.php?lang=EN
httр://mysite.com/index.php?route=add&lang=EN
httр://mysite.com/index.php?route=view&lang=EN
and so on. What I want is to create 301 redirects so that EN could be changed to GB. For example, if a customer opens httр://mysite.com/index.php?route=add&lang=EN, he should be redirected to httр://mysite.com/index.php?route=add&lang=GB.
I have searched for this for days and have failed to find a working solution. Please help.
Does it have to be done in .htaccess? Here's a relatively simple way of doing it in PHP:
<?
if ("EN" == $_GET['lang']) {
$params = $_GET;
$params['lang'] = "GB";
$query_strings = array();
foreach ($params as $key => $value) {
$query_strings[] = $key . "=" . $value;
}
header("HTTP/1.1 301 Moved Permanently");
header("Location: http://www.mysite.com?" . join($query_strings, "&");
}
Bottom line is that it may be easier to fix this problem on a level where you can isolate each query parameter and look at just the lang parameter and determine whether to do a redirect.
With regular expressions (as you would need to use in .htaccess) it's harder to isolate just the lang part. You would also need one line per language you want to redirect and maintain the list.

How to exctract a string from the following HTML page using PHP

Got stuck at some stuff. In short, I need to extract some certain data from a webpage.
Basically, I need to extract /title/tt0118615/ from
Anaconda"
by using preg_match() or whatever other ways. That's a piece of the code from the page which is extracted by the php code below:
<?php
$url = "http://www.imdb.com/find?s=tt&q=Anaconda";
$raw = file_get_contents($url);
echo preg_match ("/^(href=\"\/title\/tt)\"$/", $raw, $data);
echo "data: $data[1]";
?>
I know I'm wrong at the pattern, so that's why I'm posting my question here.
Thanks in advance.
I think this pattern will work in your case:
preg_match("/a href=\"([^\"]*)\"/", $raw, $data);
$data will be an array containing your results, $data[1] is the one you're looking for
$url = "http://www.imdb.com/find?s=tt&q=Anaconda";
$raw = file_get_contents($url);
preg_match_all('%b\.gif\?link=(/title/.*?)\'%i', $raw, $imdbcode, PREG_PATTERN_ORDER);
$imdbcode = $imdbcode[1][0];
echo $imdbcode; #echo's /title/tt0118615/

how create read more function like wordpress in codeigniter

i have no idea how wordpress use <!--more--> to seperate the post then create read more link.
any idea?
thanks
Use the word_limiter() function from the Text Helper included in CodeIgniter to shorted your post to a fixed number of words, then append the "read more" hyperlink to that text, and echo.
Text Helper Reference
take a look in the WP source, the function is located in wp-includes/post-template.php around line 200 in the get_the_content function
I wouldn't recommend just copying and pasting as it likely won't work, but you may get the logic behind it. WP uses a preg_match for the <!--more --> tag, then parses it if it exists..
$content = $pages[$page-1];
if ( preg_match('/<!--more(.*?)?-->/', $content, $matches) ) {
$content = explode($matches[0], $content, 2);
if ( !empty($matches[1]) && !empty($more_link_text) )
$more_link_text = strip_tags(wp_kses_no_null(trim($matches[1])));
$hasTeaser = true;
} else {
// so on

Resources