Here’s a code to scrape for links/url of a webpage using PHP cURL, PHP DOMDocument and PHP DOMXPath.
<?php
$target = “http://joemarie-aliling.com/category/php-programming/”;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, “Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6″); //pretend that we are a browser..
$page = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($page); //@ suppresses any errors..
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(“/html/body//a”); // evaluate everything inside the html and body tags and extract the anchor tag. Read this
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(‘href’);
if($url[0] == “/” || $url[0] == “#”) $url = $target . $url; // this is to correct relative URLs and page anchors. try to remove this line and see the effect.
echo $url . “<br>”;
}
?>
There you have it! Hope this helps with your screen scraping projects.
-JM