Posts tagged ‘scraping’

Here’s a code to scrape for links/url of a webpage using PHP cURL, PHP DOMDocument and PHP DOMXPath.

<?php

$target = “http://joemarie-aliling.com/category/php-programming/”;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, “Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6)  Gecko/20070725 Firefox/2.0.0.6″);           //pretend that we are a browser..
$page = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($page);       //@ suppresses any errors..
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(“/html/body//a”);        // evaluate everything inside the html and body tags and extract the anchor tag. Read this
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(‘href’);
if($url[0] == “/” || $url[0] == “#”) $url = $target . $url;  // this is to correct relative URLs and page anchors. try to remove this line and        see the effect.
echo $url . “<br>”;
}

?>

There you have it! Hope this helps with your screen scraping projects.

-JM