Firefox extension to get google pagerank and alexa ranking - firefox

I am creating a Mozilla Firefox toolbar to show PageRank and Alexa ranking of current website. One way I came to know is to use XMLHttpRequest in my JavaScript file to get information from a PHP page hosted on my website's server.
The PHP class has this function:
function check($page) {
// Open a socket to the toolbarqueries address, used by Google Toolbar
$socket = fsockopen("toolbarqueries.google.com", 80, $errno, $errstr, 30);
// If a connection can be established
if($socket) {
// Prep socket headers
$out = "GET /tbr?client=navclient-auto&ch=".$this->checkHash($this->createHash($page)).
"&features=Rank&q=info:".$page."&num=100&filter=0 HTTP/1.1\r\n";
$out .= "Host: toolbarqueries.google.com\r\n";
$out .= "User-Agent: Mozilla/4.0 (compatible; GoogleToolbar 2.0.114-big; Windows XP 5.1)\r\n";
$out .= "Connection: Close\r\n\r\n";
// Write settings to the socket
fwrite($socket, $out);
// When a response is received...
$result = "";
while(!feof($socket)) {
$data = fgets($socket, 128);
$pos = strpos($data, "Rank_");
if($pos !== false){
$pagerank = substr($data, $pos + 9);
$result += $pagerank;
}
}
// Close the connection
fclose($socket);
// Return the rank!
return $result;
}
Is there a better way to get page ranks in my custom Firefox toolbar without having to host a PHP service?
// Create a url hash
function createHash($string) {
$check1 = $this->stringToNumber($string, 0x1505, 0x21);
$check2 = $this->stringToNumber($string, 0, 0x1003F);
$factor = 4;
$halfFactor = $factor/2;
$check1 >>= $halfFactor;
$check1 = (($check1 >> $factor) & 0x3FFFFC0 ) | ($check1 & 0x3F);
$check1 = (($check1 >> $factor) & 0x3FFC00 ) | ($check1 & 0x3FF);
$check1 = (($check1 >> $factor) & 0x3C000 ) | ($check1 & 0x3FFF);
$calc1 = (((($check1 & 0x3C0) << $factor) | ($check1 & 0x3C)) << $halfFactor ) | ($check2 & 0xF0F );
$calc2 = (((($check1 & 0xFFFFC000) << $factor) | ($check1 & 0x3C00)) << 0xA) | ($check2 & 0xF0F0000 );
return ($calc1 | $calc2);
}
// Create checksum for hash
function checkHash($hashNumber)
{
$check = 0;
$flag = 0;
$hashString = sprintf('%u', $hashNumber) ;
$length = strlen($hashString);
for ($i = $length - 1; $i >= 0; $i --) {
$r = $hashString{$i};
if(1 === ($flag % 2)) {
$r += $r;
$r = (int)($r / 10) + ($r % 10);
}
$check += $r;
$flag ++;
}
$check %= 10;
if(0 !== $check) {
$check = 10 - $check;
if(1 === ($flag % 2) ) {
if(1 === ($check % 2)) {
$check += 9;
}
$check >>= 1;
}
}
return '7'.$check.$hashString;
}

If your PHP code only makes an HTTP request then you can just do the same request from the Firefox extension as well:
var request = new XMLHttpHeader();
request.open("http://toolbarqueries.google.com/tbr?client=navclient-auto&ch=...");
request.setRequestHeader("User-Agent", "Mozilla/4.0 (compatible; GoogleToolbar 2.0.114-big; Windows XP 5.1)");
request.send();
However, you should clarify whether this use of a Google server (particularly masquerading as Google Toolbar) complies with their Terms of Service. Otherwise you might find yourself confronted with legal action or at the very least sudden changes in the way this web service works.
As to the hash function: obviously you can either translate this algorithm to JavaScript (which is pretty straightforward from the look of it) or search around to see whether anybody did it already. E.g. I found this JS-based hash algorithm implementation (it's a different algorithm that is prefixed with 8 instead of 7 however, note also that this prefix isn't returned by the hash function but is rather part of the URL there).

Related

PHP GD Linear Dodge (Add)

There is a function in PHP GD for Linear Dodge?
I wrote this function but I would like to know if exists a function like this in PHP GD
function imagecopyAdd (&$dst_im, &$src_im, $dst_x, $dst_y, $src_x, $src_y, $src_w, $src_h)
{
for($i = 0; $i < $src_w; $i++)
{
for($c = 0; $c < $src_h; $c++)
{
$rgb1 = imagecolorat($dst_im, ($i+$dst_x), ($c+$dst_y));
$colors1 = imagecolorsforindex($dst_im, $rgb1);
$rgb2 = imagecolorat($src_im, ($i+$src_x), ($c+$src_y));
$colors2 = imagecolorsforindex($src_im, $rgb2);
$r = $colors1["red"]+$colors2["red"];
if($r > 255)
$r = 255;
$g = $colors1["green"]+$colors2["green"];
if($g > 255)
$g = 255;
$b = $colors1["blue"]+$colors2["blue"];
if($b > 255)
$b = 255;
$color = imagecolorallocate($dst_im, $r, $g, $b);
imagesetpixel($dst_im, ($i+$dst_x), ($c+$dst_y), $color);
}
}
}
While my understanding of image manipulation at more than a visual level is fundamental at best, it seems like this may be built in to the options for imagefilter.
See this comment for the users' custom implementation that he says performs the exact same function as the built in option:
It turns out PHP's Colorize is the equivalent of Photoshop's "Linear Dodge" layer filter.
Note these lines in his comment:
// Usage: Just as imagefilter(), except with no filtertype.
// imagefilterhue(resource $image, int $red, int $green , int $blue)
imagefilterhue($im,2,70,188);
// The equivalent with colorize, as tested in demo image:
imagefilter($im, IMG_FILTER_COLORIZE, 2, 70, 188);
Where imagefilterhue is his custom implementation and imagefilter is the PHP GD function.

Update query for loop is not working

I have a query like this inside for loop in codeigniter. But it executes with another values. Not with the values getting through POST method
$j = $_POST['hidden'];
$inv_id = $_POST['invoice_id'];
$sum = '';
for($i = 1; $i <= $j; $i++){
$wh_quantity1 = $_POST['quantity'.$i];
//print_r($wh_quantity1);
if($wh_quantity1 ==''){
$wh_quantity = 0;
}
else{
$wh_quantity = $wh_quantity1;
}
$query = "UPDATE tb_warehouse_stocks SET wh_product_qty = wh_product_qty - $wh_quantity WHERE invoice_id = '$inv_id'";
$this->db->query($query);
$sum += $wh_quantity;
}
Why it is like that. It always updates with greater values than POST value
Put this in .htaccess file
RewriteEngine On
RewriteRule ^ http://example.com/international/university-english-access-course$ http://example.com/website/page/english-access [R=301,L]
Try this in case your dont have all post index
$j = $this->input->post('hidden');
$inv_id = $this->input->post('invoice_id');
$sum = 0;
for ($i = 1; $i <= $j; $i++) {
$wh_quantity = (int) $this->input->post('quantity' . $i);
$sum += $wh_quantity;
}
$query = "UPDATE tb_warehouse_stocks SET wh_product_qty = wh_product_qty - $sum WHERE invoice_id = '$inv_id'";
$this->db->query($query);

How to read content of doc file using php or in codeigniter

private function read_doc($filename) {
$fileHandle = fopen($filename, "r");
var_dump($filename);
$line = fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D), $line);
$outtext = "";
foreach ($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0)) {
} else {
$outtext .= $thisline . " ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t#\/\_\(\)]/", "", $outtext);
return $outtext;
}
I am trying to read content of .doc using the above code.. but when i run the above code it give's me
filesize(): stat failed for http://localhost/jobportal/public/uploads/document/1.doc and
fread(): Length parameter must be greater than 0
As you can see here http://php.net/manual/en/function.filesize.php :
As of PHP 5.0.0, this function can also be used with some URL wrappers. Refer to Supported Protocols and Wrappers to determine which wrappers support stat() family of functionality.
and if you will go to this link http://www.php.net/manual/en/wrappers.http.php , you will see that Supports stat() is "NO". So you can't use http links with filesize function. If this is local file just use absolute or relevant path of the file like '/path/to/my/file', if this is remote file (not sure but) I think it should be downloaded first with curl and curl_getinfo function to read Content-Length http property
That filesize(): stat failed for http://localhost/jobportal/public/uploads/document/1.doc means: you are trying to get the filesize of an URL which isn't applicable.
You need to read from the stream in chunks until end and check how many data actually has been submitted.
if($ext == 'doc'){
if(file_exists($filename) ) {
if(($fh = fopen($filename, 'r')) !== false ) {
$headers = fread($fh, 0xA00);
$n1 = ( ord($headers[0x21C]) - 1 );
$n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );
$n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );
$n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );
$textLength = ($n1 + $n2 + $n3 + $n4);
$content_of_file1 = fread($fh, $textLength);
$content_of_file = strtolower($content_of_file1);
}
}
}

Designing a web crawler

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.
How does it all begin from the beginning.
Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question).
As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages.
What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc..
I have taken Google as an example. Though Google doesn't leak how its web crawler algorithms and page ranking etc work, but any guesses?
If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper:
In the course of extracting links, any
Web crawler will encounter multiple
links to the same document. To avoid
downloading and processing a document
multiple times, a URL-seen test must
be performed on each extracted link
before adding it to the URL frontier.
(An alternative design would be to
instead perform the URL-seen test when
the URL is removed from the frontier,
but this approach would result in a
much larger frontier.)
To perform the
URL-seen test, we store all of the
URLs seen by Mercator in canonical
form in a large table called the URL
set. Again, there are too many entries
for them all to fit in memory, so like
the document fingerprint set, the URL
set is stored mostly on disk.
To save
space, we do not store the textual
representation of each URL in the URL
set, but rather a fixed-sized
checksum. Unlike the fingerprints
presented to the content-seen test’s
document fingerprint set, the stream
of URLs tested against the URL set has
a non-trivial amount of locality. To
reduce the number of operations on the
backing disk file, we therefore keep
an in-memory cache of popular URLs.
The intuition for this cache is that
links to some URLs are quite common,
so caching the popular ones in memory
will lead to a high in-memory hit
rate.
In fact, using an in-memory
cache of 2^18 entries and the LRU-like
clock replacement policy, we achieve
an overall hit rate on the in-memory
cache of 66.2%, and a hit rate of 9.5%
on the table of recently-added URLs,
for a net hit rate of 75.7%. Moreover,
of the 24.3% of requests that miss in
both the cache of popular URLs and the
table of recently-added URLs, about
1=3 produce hits on the buffer in our
random access file implementation,
which also resides in user-space. The
net result of all this buffering is
that each membership test we perform
on the URL set results in an average
of 0.16 seek and 0.17 read kernel
calls (some fraction of which are
served out of the kernel’s file system
buffers). So, each URL set membership
test induces one-sixth as many kernel
calls as a membership test on the
document fingerprint set. These
savings are purely due to the amount
of URL locality (i.e., repetition of
popular URLs) inherent in the stream
of URLs encountered during a crawl.
Basically they hash all of the URLs with a hashing function that guarantees unique hashes for each URL and due to the locality of URLs, it becomes very easy to find URLs. Google even open-sourced their hashing function: CityHash
WARNING!
They might also be talking about bot traps!!! A bot trap is a section of a page that keeps generating new links with unique URLs and you will essentially get trapped in an "infinite loop" by following the links that are being served by that page. This is not exactly a loop, because a loop would be the result of visiting the same URL, but it's an infinite chain of URLs which you should avoid crawling.
Update 12/13/2012- the day after the world was supposed to end :)
Per Fr0zenFyr's comment: if one uses the AOPIC algorithm for selecting pages, then it's fairly easy to avoid bot-traps of the infinite loop kind. Here is a summary of how AOPIC works:
Get a set of N seed pages.
Allocate X amount of credit to each page, such that each page has X/N credit (i.e. equal amount of credit) before crawling has started.
Select a page P, where the P has the highest amount of credit (or if all pages have the same amount of credit, then crawl a random page).
Crawl page P (let's say that P had 100 credits when it was crawled).
Extract all the links from page P (let's say there are 10 of them).
Set the credits of P to 0.
Take a 10% "tax" and allocate it to a Lambda page.
Allocate an equal amount of credits each link found on page P from P's original credit - the tax: so (100 (P credits) - 10 (10% tax))/10 (links) = 9 credits per each link.
Repeat from step 3.
Since the Lambda page continuously collects tax, eventually it will be the page with the largest amount of credit and we'll have to "crawl" it. I say "crawl" in quotes, because we don't actually make an HTTP request for the Lambda page, we just take its credits and distribute them equally to all of the pages in our database.
Since bot traps only give internal links credits and they rarely get credit from the outside, they will continually leak credits (from taxation) to the Lambda page. The Lambda page will distribute that credits out to all of the pages in the database evenly and upon each cycle the bot trap page will lose more and more credits, until it has so little credits that it almost never gets crawled again. This will not happen with good pages, because they often get credits from back-links found on other pages. This also results in a dynamic page rank and what you will notice is that any time you take a snapshot of your database, order the pages by the amount of credits they have, then they will most likely be ordered roughly according to their true page rank.
This only avoid bot traps of the infinite-loop kind, but there are many other bot traps which you should watch out for and there are ways to get around them too.
While everybody here already suggested how to create your web crawler, here is how how Google ranks pages.
Google gives each page a rank based on the number of callback links (how many links on other websites point to a specific website/page). This is called relevance score. This is based on the fact that if a page has many other pages link to it, it's probably an important page.
Each site/page is viewed as a node in a graph. Links to other pages are directed edges. A degree of a vertex is defined as the number of incoming edges. Nodes with a higher number of incoming edges are ranked higher.
Here's how the PageRank is determined. Suppose that page Pj has Lj links. If one of those links is to page Pi, then Pj will pass on 1/Lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. So if we denote the set of pages linking to Pi by Bi, then we have this formula:
Importance(Pi)= sum( Importance(Pj)/Lj ) for all links from Pi to Bi
The ranks are placed in a matrix called hyperlink matrix: H[i,j]
A row in this matrix is either 0, or 1/Lj if there is a link from Pi to Bi. Another property of this matrix is that if we sum all rows in a column we get 1.
Now we need multiply this matrix by an Eigen vector, named I (with eigen value 1) such that:
I = H*I
Now we start iterating: IH, IIH, IIIH .... I^k *H until the solution converges. ie we get pretty much the same numbers in the matrix in step k and k+1.
Now whatever is left in the I vector is the importance of each page.
For a simple class homework example see http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
As for solving the duplicate issue in your interview question, do a checksum on the entire page and use either that or a bash of the checksum as your key in a map to keep track of visited pages.
Depends on how deep their question was intended to be. If they were just trying to avoid following the same links back and forth, then hashing the URL's would be sufficient.
What about content that has literally thousands of URL's that lead to the same content? Like a QueryString parameter that doesn't affect anything, but can have an infinite number of iterations. I suppose you could hash the contents of the page as well and compare URL's to see if they are similar to catch content that is identified by multiple URL's. See for example, Bot Traps mentioned in #Lirik's post.
You'd have to have some sort of hash table to store the results in, you'd just have to check it before each page load.
The problem here is not to crawl duplicated URLS, wich is resolved by a index using a hash obtained from urls. The problem is to crawl DUPLICATED CONTENT. Each url of a "Crawler Trap" is different (year, day, SessionID...).
There is not a "perfect" solution... but you can use some of this strategies:
• Keep a field of wich level the url is inside the website. For each cicle of getting urls from a page, increase the level. It will be like a tree. You can stop to crawl at certain level, like 10 (i think google use this).
• You can try to create a kind of HASH wich can be compared to find similar documents, since you cant compare with each document in your database. There are SimHash from google, but i could not find any implementation to use. Then i´ve created my own. My hash count low and high frequency characters inside the html code and generate a 20bytes hash, wich is compared with a small cache of last crawled pages inside a AVLTree with an NearNeighbors search with some tolerance (about 2). You cant use any reference to characters locations in this hash. After "recognize" the trap, you can record the url pattern of the duplicate content and start to ignore pages with that too.
• Like google, you can create a ranking to each website and "trust" more in one than others.
The web crawler is a computer program which used to collect/crawling following key values(HREF links, Image links, Meta Data .etc) from given website URL. It is designed like intelligent to follow different HREF links which are already fetched from the previous URL, so in this way, Crawler can jump from one website to other websites. Usually, it called as a Web spider or Web Bot. This mechanism always acts as a backbone of the Web search engine.
Please find the source code from my tech blog - http://www.algonuts.info/how-to-built-a-simple-web-crawler-in-php.html
<?php
class webCrawler
{
public $siteURL;
public $error;
function __construct()
{
$this->siteURL = "";
$this->error = "";
}
function parser()
{
global $hrefTag,$hrefTagCountStart,$hrefTagCountFinal,$hrefTagLengthStart,$hrefTagLengthFinal,$hrefTagPointer;
global $imgTag,$imgTagCountStart,$imgTagCountFinal,$imgTagLengthStart,$imgTagLengthFinal,$imgTagPointer;
global $Url_Extensions,$Document_Extensions,$Image_Extensions,$crawlOptions;
$dotCount = 0;
$slashCount = 0;
$singleSlashCount = 0;
$doubleSlashCount = 0;
$parentDirectoryCount = 0;
$linkBuffer = array();
if(($url = trim($this->siteURL)) != "")
{
$crawlURL = rtrim($url,"/");
if(($directoryURL = dirname($crawlURL)) == "http:")
{ $directoryURL = $crawlURL; }
$urlParser = preg_split("/\//",$crawlURL);
//-- Curl Start --
$curlObject = curl_init($crawlURL);
curl_setopt_array($curlObject,$crawlOptions);
$webPageContent = curl_exec($curlObject);
$errorNumber = curl_errno($curlObject);
curl_close($curlObject);
//-- Curl End --
if($errorNumber == 0)
{
$webPageCounter = 0;
$webPageLength = strlen($webPageContent);
while($webPageCounter < $webPageLength)
{
$character = $webPageContent[$webPageCounter];
if($character == "")
{
$webPageCounter++;
continue;
}
$character = strtolower($character);
//-- Href Filter Start --
if($hrefTagPointer[$hrefTagLengthStart] == $character)
{
$hrefTagLengthStart++;
if($hrefTagLengthStart == $hrefTagLengthFinal)
{
$hrefTagCountStart++;
if($hrefTagCountStart == $hrefTagCountFinal)
{
if($hrefURL != "")
{
if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
{
if($doubleSlashCount >= 1)
{ $hrefURL = "http://".$hrefURL; }
else if($parentDirectoryCount >= 1)
{
$tempData = 0;
$tempString = "";
$tempTotal = count($urlParser) - $parentDirectoryCount;
while($tempData < $tempTotal)
{
$tempString .= $urlParser[$tempData]."/";
$tempData++;
}
$hrefURL = $tempString."".$hrefURL;
}
else if($singleSlashCount >= 1)
{ $hrefURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$hrefURL; }
}
$host = "";
$hrefURL = urldecode($hrefURL);
$hrefURL = rtrim($hrefURL,"/");
if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
{
$dump = parse_url($hrefURL);
if(isset($dump["host"]))
{ $host = trim(strtolower($dump["host"])); }
}
else
{
$hrefURL = $directoryURL."/".$hrefURL;
if(filter_var($hrefURL,FILTER_VALIDATE_URL) == true)
{
$dump = parse_url($hrefURL);
if(isset($dump["host"]))
{ $host = trim(strtolower($dump["host"])); }
}
}
if($host != "")
{
$extension = pathinfo($hrefURL,PATHINFO_EXTENSION);
if($extension != "")
{
$tempBuffer ="";
$extensionlength = strlen($extension);
for($tempData = 0; $tempData < $extensionlength; $tempData++)
{
if($extension[$tempData] != "?")
{
$tempBuffer = $tempBuffer.$extension[$tempData];
continue;
}
else
{
$extension = trim($tempBuffer);
break;
}
}
if(in_array($extension,$Url_Extensions))
{ $type = "domain"; }
else if(in_array($extension,$Image_Extensions))
{ $type = "image"; }
else if(in_array($extension,$Document_Extensions))
{ $type = "document"; }
else
{ $type = "unknown"; }
}
else
{ $type = "domain"; }
if($hrefURL != "")
{
if($type == "domain" && !in_array($hrefURL,$this->linkBuffer["domain"]))
{ $this->linkBuffer["domain"][] = $hrefURL; }
if($type == "image" && !in_array($hrefURL,$this->linkBuffer["image"]))
{ $this->linkBuffer["image"][] = $hrefURL; }
if($type == "document" && !in_array($hrefURL,$this->linkBuffer["document"]))
{ $this->linkBuffer["document"][] = $hrefURL; }
if($type == "unknown" && !in_array($hrefURL,$this->linkBuffer["unknown"]))
{ $this->linkBuffer["unknown"][] = $hrefURL; }
}
}
}
$hrefTagCountStart = 0;
}
if($hrefTagCountStart == 3)
{
$hrefURL = "";
$dotCount = 0;
$slashCount = 0;
$singleSlashCount = 0;
$doubleSlashCount = 0;
$parentDirectoryCount = 0;
$webPageCounter++;
while($webPageCounter < $webPageLength)
{
$character = $webPageContent[$webPageCounter];
if($character == "")
{
$webPageCounter++;
continue;
}
if($character == "\"" || $character == "'")
{
$webPageCounter++;
while($webPageCounter < $webPageLength)
{
$character = $webPageContent[$webPageCounter];
if($character == "")
{
$webPageCounter++;
continue;
}
if($character == "\"" || $character == "'" || $character == "#")
{
$webPageCounter--;
break;
}
else if($hrefURL != "")
{ $hrefURL .= $character; }
else if($character == "." || $character == "/")
{
if($character == ".")
{
$dotCount++;
$slashCount = 0;
}
else if($character == "/")
{
$slashCount++;
if($dotCount == 2 && $slashCount == 1)
$parentDirectoryCount++;
else if($dotCount == 0 && $slashCount == 1)
$singleSlashCount++;
else if($dotCount == 0 && $slashCount == 2)
$doubleSlashCount++;
$dotCount = 0;
}
}
else
{ $hrefURL .= $character; }
$webPageCounter++;
}
break;
}
$webPageCounter++;
}
}
$hrefTagLengthStart = 0;
$hrefTagLengthFinal = strlen($hrefTag[$hrefTagCountStart]);
$hrefTagPointer =& $hrefTag[$hrefTagCountStart];
}
}
else
{ $hrefTagLengthStart = 0; }
//-- Href Filter End --
//-- Image Filter Start --
if($imgTagPointer[$imgTagLengthStart] == $character)
{
$imgTagLengthStart++;
if($imgTagLengthStart == $imgTagLengthFinal)
{
$imgTagCountStart++;
if($imgTagCountStart == $imgTagCountFinal)
{
if($imgURL != "")
{
if($parentDirectoryCount >= 1 || $singleSlashCount >= 1 || $doubleSlashCount >= 1)
{
if($doubleSlashCount >= 1)
{ $imgURL = "http://".$imgURL; }
else if($parentDirectoryCount >= 1)
{
$tempData = 0;
$tempString = "";
$tempTotal = count($urlParser) - $parentDirectoryCount;
while($tempData < $tempTotal)
{
$tempString .= $urlParser[$tempData]."/";
$tempData++;
}
$imgURL = $tempString."".$imgURL;
}
else if($singleSlashCount >= 1)
{ $imgURL = $urlParser[0]."/".$urlParser[1]."/".$urlParser[2]."/".$imgURL; }
}
$host = "";
$imgURL = urldecode($imgURL);
$imgURL = rtrim($imgURL,"/");
if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
{
$dump = parse_url($imgURL);
$host = trim(strtolower($dump["host"]));
}
else
{
$imgURL = $directoryURL."/".$imgURL;
if(filter_var($imgURL,FILTER_VALIDATE_URL) == true)
{
$dump = parse_url($imgURL);
$host = trim(strtolower($dump["host"]));
}
}
if($host != "")
{
$extension = pathinfo($imgURL,PATHINFO_EXTENSION);
if($extension != "")
{
$tempBuffer ="";
$extensionlength = strlen($extension);
for($tempData = 0; $tempData < $extensionlength; $tempData++)
{
if($extension[$tempData] != "?")
{
$tempBuffer = $tempBuffer.$extension[$tempData];
continue;
}
else
{
$extension = trim($tempBuffer);
break;
}
}
if(in_array($extension,$Url_Extensions))
{ $type = "domain"; }
else if(in_array($extension,$Image_Extensions))
{ $type = "image"; }
else if(in_array($extension,$Document_Extensions))
{ $type = "document"; }
else
{ $type = "unknown"; }
}
else
{ $type = "domain"; }
if($imgURL != "")
{
if($type == "domain" && !in_array($imgURL,$this->linkBuffer["domain"]))
{ $this->linkBuffer["domain"][] = $imgURL; }
if($type == "image" && !in_array($imgURL,$this->linkBuffer["image"]))
{ $this->linkBuffer["image"][] = $imgURL; }
if($type == "document" && !in_array($imgURL,$this->linkBuffer["document"]))
{ $this->linkBuffer["document"][] = $imgURL; }
if($type == "unknown" && !in_array($imgURL,$this->linkBuffer["unknown"]))
{ $this->linkBuffer["unknown"][] = $imgURL; }
}
}
}
$imgTagCountStart = 0;
}
if($imgTagCountStart == 3)
{
$imgURL = "";
$dotCount = 0;
$slashCount = 0;
$singleSlashCount = 0;
$doubleSlashCount = 0;
$parentDirectoryCount = 0;
$webPageCounter++;
while($webPageCounter < $webPageLength)
{
$character = $webPageContent[$webPageCounter];
if($character == "")
{
$webPageCounter++;
continue;
}
if($character == "\"" || $character == "'")
{
$webPageCounter++;
while($webPageCounter < $webPageLength)
{
$character = $webPageContent[$webPageCounter];
if($character == "")
{
$webPageCounter++;
continue;
}
if($character == "\"" || $character == "'" || $character == "#")
{
$webPageCounter--;
break;
}
else if($imgURL != "")
{ $imgURL .= $character; }
else if($character == "." || $character == "/")
{
if($character == ".")
{
$dotCount++;
$slashCount = 0;
}
else if($character == "/")
{
$slashCount++;
if($dotCount == 2 && $slashCount == 1)
$parentDirectoryCount++;
else if($dotCount == 0 && $slashCount == 1)
$singleSlashCount++;
else if($dotCount == 0 && $slashCount == 2)
$doubleSlashCount++;
$dotCount = 0;
}
}
else
{ $imgURL .= $character; }
$webPageCounter++;
}
break;
}
$webPageCounter++;
}
}
$imgTagLengthStart = 0;
$imgTagLengthFinal = strlen($imgTag[$imgTagCountStart]);
$imgTagPointer =& $imgTag[$imgTagCountStart];
}
}
else
{ $imgTagLengthStart = 0; }
//-- Image Filter End --
$webPageCounter++;
}
}
else
{ $this->error = "Unable to proceed, permission denied"; }
}
else
{ $this->error = "Please enter url"; }
if($this->error != "")
{ $this->linkBuffer["error"] = $this->error; }
return $this->linkBuffer;
}
}
?>
Well the web is basically a directed graph, so you can construct a graph out of the urls and then do a BFS or DFS traversal while marking the visited nodes so you don't visit the same page twice.
This is a web crawler example. Which can be used to collect mac Addresses for mac spoofing.
#!/usr/bin/env python
import sys
import os
import urlparse
import urllib
from bs4 import BeautifulSoup
def mac_addr_str(f_data):
global fptr
global mac_list
word_array = f_data.split(" ")
for word in word_array:
if len(word) == 17 and ':' in word[2] and ':' in word[5] and ':' in word[8] and ':' in word[11] and ':' in word[14]:
if word not in mac_list:
mac_list.append(word)
fptr.writelines(word +"\n")
print word
url = "http://stackoverflow.com/questions/tagged/mac-address"
url_list = [url]
visited = [url]
pwd = os.getcwd();
pwd = pwd + "/internet_mac.txt";
fptr = open(pwd, "a")
mac_list = []
while len(url_list) > 0:
try:
htmltext = urllib.urlopen(url_list[0]).read()
except:
url_list[0]
mac_addr_str(htmltext)
soup = BeautifulSoup(htmltext)
url_list.pop(0)
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
url_list.append(tag['href'])
visited.append(tag['href'])
Change the url to crawl more sites......good luck

Calculate when a cron job will be executed then next time

I have a cron "time definition"
1 * * * * (every hour at xx:01)
2 5 * * * (every day at 05:02)
0 4 3 * * (every third day of the month at 04:00)
* 2 * * 5 (every minute between 02:00 and 02:59 on fridays)
And I have an unix timestamp.
Is there an obvious way to find (calculate) the next time (after that given timestamp) the job is due to be executed?
I'm using PHP, but the problem should be fairly language-agnostic.
[Update]
The class "PHP Cron Parser" (suggested by Ray) calculates the LAST time the CRON job was supposed to be executed, not the next time.
To make it easier: In my case the cron time parameters are only absolute, single numbers or "*". There are no time-ranges and no "*/5" intervals.
Here's a PHP project that is based on dlamblin's psuedo code.
It can calculate the next run date of a CRON expression, the previous run date of a CRON expression, and determine if a CRON expression matches a given time. You can skip This CRON expression parser fully implements CRON:
Increments of ranges (e.g. */12, 3-59/15)
Intervals (e.g. 1-4, MON-FRI, JAN-MAR )
Lists (e.g. 1,2,3 | JAN,MAR,DEC)
Last day of a month (e.g. L)
Last given weekday of a month (e.g. 5L)
Nth given weekday of a month (e.g. 3#2, 1#1, MON#4)
Closest weekday to a given day of the month (e.g. 15W, 1W, 30W)
https://github.com/mtdowling/cron-expression
Usage (PHP 5.3+):
<?php
// Works with predefined scheduling definitions
$cron = Cron\CronExpression::factory('#daily');
$cron->isDue();
$cron->getNextRunDate();
$cron->getPreviousRunDate();
// Works with complex expressions
$cron = Cron\CronExpression::factory('15 2,6-12 */15 1 2-5');
$cron->getNextRunDate();
This is basically doing the reverse of checking if the current time fits the conditions. so something like:
//Totaly made up language
next = getTimeNow();
next.addMinutes(1) //so that next is never now
done = false;
while (!done) {
if (cron.minute != '*' && next.minute != cron.minute) {
if (next.minute > cron.minute) {
next.addHours(1);
}
next.minute = cron.minute;
}
if (cron.hour != '*' && next.hour != cron.hour) {
if (next.hour > cron.hour) {
next.hour = cron.hour;
next.addDays(1);
next.minute = 0;
continue;
}
next.hour = cron.hour;
next.minute = 0;
continue;
}
if (cron.weekday != '*' && next.weekday != cron.weekday) {
deltaDays = cron.weekday - next.weekday //assume weekday is 0=sun, 1 ... 6=sat
if (deltaDays < 0) { deltaDays+=7; }
next.addDays(deltaDays);
next.hour = 0;
next.minute = 0;
continue;
}
if (cron.day != '*' && next.day != cron.day) {
if (next.day > cron.day || !next.month.hasDay(cron.day)) {
next.addMonths(1);
next.day = 1; //assume days 1..31
next.hour = 0;
next.minute = 0;
continue;
}
next.day = cron.day
next.hour = 0;
next.minute = 0;
continue;
}
if (cron.month != '*' && next.month != cron.month) {
if (next.month > cron.month) {
next.addMonths(12-next.month+cron.month)
next.day = 1; //assume days 1..31
next.hour = 0;
next.minute = 0;
continue;
}
next.month = cron.month;
next.day = 1;
next.hour = 0;
next.minute = 0;
continue;
}
done = true;
}
I might have written that a bit backwards. Also it can be a lot shorter if in every main if instead of doing the greater than check you merely increment the current time grade by one and set the lesser time grades to 0 then continue; however then you'll be looping a lot more. Like so:
//Shorter more loopy version
next = getTimeNow().addMinutes(1);
while (true) {
if (cron.month != '*' && next.month != cron.month) {
next.addMonths(1);
next.day = 1;
next.hour = 0;
next.minute = 0;
continue;
}
if (cron.day != '*' && next.day != cron.day) {
next.addDays(1);
next.hour = 0;
next.minute = 0;
continue;
}
if (cron.weekday != '*' && next.weekday != cron.weekday) {
next.addDays(1);
next.hour = 0;
next.minute = 0;
continue;
}
if (cron.hour != '*' && next.hour != cron.hour) {
next.addHours(1);
next.minute = 0;
continue;
}
if (cron.minute != '*' && next.minute != cron.minute) {
next.addMinutes(1);
continue;
}
break;
}
For anyone interested, here's my final PHP implementation, which pretty much equals dlamblin pseudo code:
class myMiniDate {
var $myTimestamp;
static private $dateComponent = array(
'second' => 's',
'minute' => 'i',
'hour' => 'G',
'day' => 'j',
'month' => 'n',
'year' => 'Y',
'dow' => 'w',
'timestamp' => 'U'
);
static private $weekday = array(
1 => 'monday',
2 => 'tuesday',
3 => 'wednesday',
4 => 'thursday',
5 => 'friday',
6 => 'saturday',
0 => 'sunday'
);
function __construct($ts = NULL) { $this->myTimestamp = is_null($ts)?time():$ts; }
function __set($var, $value) {
list($c['second'], $c['minute'], $c['hour'], $c['day'], $c['month'], $c['year'], $c['dow']) = explode(' ', date('s i G j n Y w', $this->myTimestamp));
switch ($var) {
case 'dow':
$this->myTimestamp = strtotime(self::$weekday[$value], $this->myTimestamp);
break;
case 'timestamp':
$this->myTimestamp = $value;
break;
default:
$c[$var] = $value;
$this->myTimestamp = mktime($c['hour'], $c['minute'], $c['second'], $c['month'], $c['day'], $c['year']);
}
}
function __get($var) {
return date(self::$dateComponent[$var], $this->myTimestamp);
}
function modify($how) { return $this->myTimestamp = strtotime($how, $this->myTimestamp); }
}
$cron = new myMiniDate(time() + 60);
$cron->second = 0;
$done = 0;
echo date('Y-m-d H:i:s') . '<hr>' . date('Y-m-d H:i:s', $cron->timestamp) . '<hr>';
$Job = array(
'Minute' => 5,
'Hour' => 3,
'Day' => 13,
'Month' => null,
'DOW' => 5,
);
while ($done < 100) {
if (!is_null($Job['Minute']) && ($cron->minute != $Job['Minute'])) {
if ($cron->minute > $Job['Minute']) {
$cron->modify('+1 hour');
}
$cron->minute = $Job['Minute'];
}
if (!is_null($Job['Hour']) && ($cron->hour != $Job['Hour'])) {
if ($cron->hour > $Job['Hour']) {
$cron->modify('+1 day');
}
$cron->hour = $Job['Hour'];
$cron->minute = 0;
}
if (!is_null($Job['DOW']) && ($cron->dow != $Job['DOW'])) {
$cron->dow = $Job['DOW'];
$cron->hour = 0;
$cron->minute = 0;
}
if (!is_null($Job['Day']) && ($cron->day != $Job['Day'])) {
if ($cron->day > $Job['Day']) {
$cron->modify('+1 month');
}
$cron->day = $Job['Day'];
$cron->hour = 0;
$cron->minute = 0;
}
if (!is_null($Job['Month']) && ($cron->month != $Job['Month'])) {
if ($cron->month > $Job['Month']) {
$cron->modify('+1 year');
}
$cron->month = $Job['Month'];
$cron->day = 1;
$cron->hour = 0;
$cron->minute = 0;
}
$done = (is_null($Job['Minute']) || $Job['Minute'] == $cron->minute) &&
(is_null($Job['Hour']) || $Job['Hour'] == $cron->hour) &&
(is_null($Job['Day']) || $Job['Day'] == $cron->day) &&
(is_null($Job['Month']) || $Job['Month'] == $cron->month) &&
(is_null($Job['DOW']) || $Job['DOW'] == $cron->dow)?100:($done+1);
}
echo date('Y-m-d H:i:s', $cron->timestamp) . '<hr>';
Use this function:
function parse_crontab($time, $crontab)
{$time=explode(' ', date('i G j n w', strtotime($time)));
$crontab=explode(' ', $crontab);
foreach ($crontab as $k=>&$v)
{$v=explode(',', $v);
foreach ($v as &$v1)
{$v1=preg_replace(array('/^\*$/', '/^\d+$/', '/^(\d+)\-(\d+)$/', '/^\*\/(\d+)$/'),
array('true', '"'.$time[$k].'"==="\0"', '(\1<='.$time[$k].' and '.$time[$k].'<=\2)', $time[$k].'%\1===0'),
$v1
);
}
$v='('.implode(' or ', $v).')';
}
$crontab=implode(' and ', $crontab);
return eval('return '.$crontab.';');
}
var_export(parse_crontab('2011-05-04 02:08:03', '*/2,3-5,9 2 3-5 */2 *'));
var_export(parse_crontab('2011-05-04 02:08:03', '*/8 */2 */4 */5 *'));
Edit Maybe this is more readable:
<?php
function parse_crontab($frequency='* * * * *', $time=false) {
$time = is_string($time) ? strtotime($time) : time();
$time = explode(' ', date('i G j n w', $time));
$crontab = explode(' ', $frequency);
foreach ($crontab as $k => &$v) {
$v = explode(',', $v);
$regexps = array(
'/^\*$/', # every
'/^\d+$/', # digit
'/^(\d+)\-(\d+)$/', # range
'/^\*\/(\d+)$/' # every digit
);
$content = array(
"true", # every
"{$time[$k]} === 0", # digit
"($1 <= {$time[$k]} && {$time[$k]} <= $2)", # range
"{$time[$k]} % $1 === 0" # every digit
);
foreach ($v as &$v1)
$v1 = preg_replace($regexps, $content, $v1);
$v = '('.implode(' || ', $v).')';
}
$crontab = implode(' && ', $crontab);
return eval("return {$crontab};");
}
Usage:
<?php
if (parse_crontab('*/5 2 * * *')) {
// should run cron
} else {
// should not run cron
}
Created javascript API for calculating next run time based on #dlamblin idea. Supports seconds and years. Have not managed to test it fully yet so expect bugs but let me know if find any.
Repository link: https://bitbucket.org/nevity/cronner
Check this out:
It can calculate the next time a scheduled job is supposed to be run based on the given cron definitions.
Thanks for posting this code. It definitely helped me out, even 6 years later.
Trying to implement I found a small bug.
date('i G j n w', $time) returns a 0 padded integer for the minutes.
Later in the code, it does a modulus on that 0 padded integer. PHP doesn't seem to handle this as expected.
$ php
<?php
print 8 % 5 . "\n";
print 08 % 5 . "\n";
?>
3
0
As you can see, 08 % 5 returns 0, whereas 8 % 5 returns the expected 3. I couldn't find a non padded option for the date command. I tried fiddling with the {$time[$k]} % $1 === 0 line (like changing {$time[$k]} to ({$time[$k]}+0), but couldn't get it to drop the 0 padding during the modulus.
So, I ended up just changing the original value returned by the date function and removed the 0 by running $time[0] = $time[0] + 0;.
Here is my test.
<?php
function parse_crontab($frequency='* * * * *', $time=false) {
$time = is_string($time) ? strtotime($time) : time();
$time = explode(' ', date('i G j n w', $time));
$time[0] = $time[0] + 0;
$crontab = explode(' ', $frequency);
foreach ($crontab as $k => &$v) {
$v = explode(',', $v);
$regexps = array(
'/^\*$/', # every
'/^\d+$/', # digit
'/^(\d+)\-(\d+)$/', # range
'/^\*\/(\d+)$/' # every digit
);
$content = array(
"true", # every
"{$time[$k]} === $0", # digit
"($1 <= {$time[$k]} && {$time[$k]} <= $2)", # range
"{$time[$k]} % $1 === 0" # every digit
);
foreach ($v as &$v1)
$v1 = preg_replace($regexps, $content, $v1);
$v = '('.implode(' || ', $v).')';
}
$crontab = implode(' && ', $crontab);
return eval("return {$crontab};");
}
for($i=0; $i<24; $i++) {
for($j=0; $j<60; $j++) {
$date=sprintf("%d:%02d",$i,$j);
if (parse_crontab('*/5 * * * *',$date)) {
print "$date yes\n";
} else {
print "$date no\n";
}
}
}
?>
My answer is not unique. Just a replica of #BlaM answer written in java because PHP's date and time is a bit different from Java.
This program assumes that the CRON expression is simple. It can only contain digits or *.
Minute = 0-60
Hour = 0-23
Day = 1-31
MONTH = 1-12 where 1 = January.
WEEKDAY = 1-7 where 1 = Sunday.
Code:
package main;
import java.util.Calendar;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CronPredict
{
public static void main(String[] args)
{
String cronExpression = "5 3 27 3 3 ls -la > a.txt";
CronPredict cronPredict = new CronPredict();
String[] parsed = cronPredict.parseCronExpression(cronExpression);
System.out.println(cronPredict.getNextExecution(parsed).getTime().toString());
}
//This method takes a cron string and separates entities like minutes, hours, etc.
public String[] parseCronExpression(String cronExpression)
{
String[] parsedExpression = null;
String cronPattern = "^([0-9]|[1-5][0-9]|\\*)\\s([0-9]|1[0-9]|2[0-3]|\\*)\\s"
+ "([1-9]|[1-2][0-9]|3[0-1]|\\*)\\s([1-9]|1[0-2]|\\*)\\s"
+ "([1-7]|\\*)\\s(.*)$";
Pattern cronRegex = Pattern.compile(cronPattern);
Matcher matcher = cronRegex.matcher(cronExpression);
if(matcher.matches())
{
String minute = matcher.group(1);
String hour = matcher.group(2);
String day = matcher.group(3);
String month = matcher.group(4);
String weekday = matcher.group(5);
String command = matcher.group(6);
parsedExpression = new String[6];
parsedExpression[0] = minute;
parsedExpression[1] = hour;
parsedExpression[2] = day;
//since java's month start's from 0 as opposed to PHP which starts from 1.
parsedExpression[3] = month.equals("*") ? month : (Integer.parseInt(month) - 1) + "";
parsedExpression[4] = weekday;
parsedExpression[5] = command;
}
return parsedExpression;
}
public Calendar getNextExecution(String[] job)
{
Calendar cron = Calendar.getInstance();
cron.add(Calendar.MINUTE, 1);
cron.set(Calendar.MILLISECOND, 0);
cron.set(Calendar.SECOND, 0);
int done = 0;
//Loop because some dates are not valid.
//e.g. March 29 which is a Friday may never come for atleast next 1000 years.
//We do not want to keep looping. Also it protects against invalid dates such as feb 30.
while(done < 100)
{
if(!job[0].equals("*") && cron.get(Calendar.MINUTE) != Integer.parseInt(job[0]))
{
if(cron.get(Calendar.MINUTE) > Integer.parseInt(job[0]))
{
cron.add(Calendar.HOUR_OF_DAY, 1);
}
cron.set(Calendar.MINUTE, Integer.parseInt(job[0]));
}
if(!job[1].equals("*") && cron.get(Calendar.HOUR_OF_DAY) != Integer.parseInt(job[1]))
{
if(cron.get(Calendar.HOUR_OF_DAY) > Integer.parseInt(job[1]))
{
cron.add(Calendar.DAY_OF_MONTH, 1);
}
cron.set(Calendar.HOUR_OF_DAY, Integer.parseInt(job[1]));
cron.set(Calendar.MINUTE, 0);
}
if(!job[4].equals("*") && cron.get(Calendar.DAY_OF_WEEK) != Integer.parseInt(job[4]))
{
Date previousDate = cron.getTime();
cron.set(Calendar.DAY_OF_WEEK, Integer.parseInt(job[4]));
Date newDate = cron.getTime();
if(newDate.before(previousDate))
{
cron.add(Calendar.WEEK_OF_MONTH, 1);
}
cron.set(Calendar.HOUR_OF_DAY, 0);
cron.set(Calendar.MINUTE, 0);
}
if(!job[2].equals("*") && cron.get(Calendar.DAY_OF_MONTH) != Integer.parseInt(job[2]))
{
if(cron.get(Calendar.DAY_OF_MONTH) > Integer.parseInt(job[2]))
{
cron.add(Calendar.MONTH, 1);
}
cron.set(Calendar.DAY_OF_MONTH, Integer.parseInt(job[2]));
cron.set(Calendar.HOUR_OF_DAY, 0);
cron.set(Calendar.MINUTE, 0);
}
if(!job[3].equals("*") && cron.get(Calendar.MONTH) != Integer.parseInt(job[3]))
{
if(cron.get(Calendar.MONTH) > Integer.parseInt(job[3]))
{
cron.add(Calendar.YEAR, 1);
}
cron.set(Calendar.MONTH, Integer.parseInt(job[3]));
cron.set(Calendar.DAY_OF_MONTH, 1);
cron.set(Calendar.HOUR_OF_DAY, 0);
cron.set(Calendar.MINUTE, 0);
}
done = (job[0].equals("*") || cron.get(Calendar.MINUTE) == Integer.parseInt(job[0])) &&
(job[1].equals("*") || cron.get(Calendar.HOUR_OF_DAY) == Integer.parseInt(job[1])) &&
(job[2].equals("*") || cron.get(Calendar.DAY_OF_MONTH) == Integer.parseInt(job[2])) &&
(job[3].equals("*") || cron.get(Calendar.MONTH) == Integer.parseInt(job[3])) &&
(job[4].equals("*") || cron.get(Calendar.DAY_OF_WEEK) == Integer.parseInt(job[4])) ? 100 : (done + 1);
}
return cron;
}
}

Resources