- Screen Scraping: a.k.a. data scraping or website scraping
- A method of data extraction from web pages usually employing a programmatic web crawler/web spider
If you’re anything like me (and I hope you’re not!) there will come a time when you say to yourself ”if only I could get the data from somedomain.com I could [insert your own dastardly plan here]“.
Over the years I’ve dabbled with a few different web scraping tools, mostly Perl and the WWW::Mechanize CPAN module. However I wanted to find out how to do this with PHP.
I also realized that a jQuery-like syntax would be ideal for screen scraping with its intuitive css style selectors. Thankfully I’m smart enough to know that there are people out there who are far cleverer than I am and must have thought of this before me. A bit of googling later I had found QueryPath a PHP library for working with XML/HTML documents that has been modeled on jQuery, ideal!
Thankfully I’m smart enough to know that there are people out there who are far cleverer than I am
So lets get started with some code. My contrived screen scraping example is as follows:
- Perform Google search for my business website ‘mindtripz.com’
- Perform data extraction on the HTML page returned by Google (the title and url of each search result)
- Find the link to the next page of search results and follow it
- Back to 2. Repeat until we reach our specified page limit
- Output our data
- Execute dastardly plan *
First we need to load the QueryPath library (you can find it here) and set up our default parameters.
- $url is obviously the initial Google search to perform.
- $baseUrl is needed later to concatenate with incomplete urls we will see later.
- $pages is our page limit i.e. stop when we have scraped this number of pages.
- $results….well that’s where we will store our results.
<?php // Load the QueryPath Library include(dirname(__FILE__) . '/library/QueryPath/QueryPath.php'); // Our URL to start our web scrape from $url = 'http://www.google.com/search?q=mindtripz.com'; // Url base to concatenate with incomplete links // I could find this with a preg_match on $url but I'm too lazy $baseUrl = 'http://www.google.com'; // No. of pages to crawl $pages = 10; // Our results go here $results = array();
Next we need to create a QueryPath object which can be done with either the qp() or htmlqp() convenience functions. htmlqp() has been written with some behind the scenes magic that makes it the preferred option when working with HTML (as opposed to XHTML/XML). htmlqp() can take a file, a string or a URL as an argument, we will be using our $url variable and QueryPath will go and get the webpage for us….could it be any easier?
$qp = htmlqp($url);
So we have a QueryPath object which contains the HTML of our target page, now what? Well we need to examine the source of the webpage we are scraping and find some identifiers i.e. tags,id’s and classes that we can use to target specific sections of the page.
You can see from the previous image that Google Chrome’s web inspector has identified for me that each search result is a list item and has classes of ‘g’, ‘w0′ and ‘knavi’. I first thought that using the full css selector i.e. ‘li.g.w0.knavi’ would be a smart thing to do. It turns out that only the ‘g’ class is in the HTML and that the others Google adds with javascript which (as expected) QueryPath doesn’t handle……remember this, it will save you from premature baldness!
So we can use our new selector ‘li.g’ to target all the search results on the page just like we would with jQuery.
$qp->find('li.g')
Still with me? OK. QueryPath uses something called method chaining which means that we can structure our code exactly the same as we do with jQuery. Additionally if we use PHP 5.3 we can use anonymous functions and closures to complete our imitation of jQuery.
So to expand our previous code snippet:
$qp->find('li.g')
// Chain 'each' function with previous 'find'
// Each will loop through the list of found elements
// and execute our anonymous function on each iteration
->each(function($index,$item) use (&$callback){
// Create a new but empty object
$obj = new stdClass();
// Get the text from the selected element
$obj->name = trim( htmlqp($item)->find('h3 a.l')->text());
// Get the href attribute value from the selected element
$obj->link = trim( htmlqp($item)->find('h3 a.l')->attr('href'));
// execute our callback
$callback($obj);
})
Now we have ‘chained’ the each() method to our previous find(). each() accepts our anonymous function as an argument and passes $index and $item to that function. $item is a PHP DOMNode object. OK, forget that (because it’s boring and over complicated). Feed $item back into a new QueryPath object and apply more selectors to zero in on the data to be scraped. For each iteration of the loop a new object is created to house our scraped data and is sent to our callback to be stored in the $results array. The image below shows where I found the selector ‘h3 a.l’ to retrieve the search result title and href attribute.
By now you are getting the idea so lets put it all together. The function scrape() encapsulates all our website scraping code and calls it’s self for each subsequent page. csv() formats our results so we can do something with them in a spreadsheet. Excessive in-code comments provided to guide the way.
<?php // Load the QueryPath Library
include(dirname(__FILE__) . '/library/QueryPath/QueryPath.php');
// Our URL to start our web scrape from
$url = 'http://www.google.com/search?q=mindtripz.com';
// Url base to concatenate with incomplete links
// I could find this with a preg_match on $url but I'm too lazy
$baseUrl = 'http://www.google.com';
// No. of pages to crawl
$pages = 10;
// Our results go here
$results = array();
// Create our screen scraping function
function scrape($url){
// Allow selected globals in our function scope
global $baseUrl,$pages,$results;
// Prefix the url with the baseUrl if it doesn't already contain it
if(!preg_match("|$baseUrl|",$url)){
$url = $baseUrl . $url;
}
// Visual progress indicator, print out the current url being scraped
// Don't you just hate sitting there wondering if something is happening?
echo $url . "\n";
// Create a callback to store results, use closure to enable access to $results array
$callback = function($resultsObj) use(&$results){
$results[] = $resultsObj;
};
// Create our QueryPath object from our URL, QueryPath handily gets the page for us
$qp = htmlqp($url);
// Do jQuery-like goodness
// find all the list elements with the class 'g'
$next = $qp->find('li.g')
// for each of the list elements execute our anonymous function
->each(function($index,$item) use (&$callback){
// Create a new but empty object
$obj = new stdClass();
// Get the text from the selected element
$obj->name = trim( htmlqp($item)->find('h3 a.l')->text());
// Get the href attribute value from the selected element
$obj->link = trim( htmlqp($item)->find('h3 a.l')->attr('href'));
// execute our callback
$callback($obj);
})
// Reset our 'cursor' to the top of the document
->top()
// Search for the current page in the page navigation
->find('#navcnt td.cur')
// Move to the next table cell
->next('td')
// Find an anchor tag in the table cell
->find('a')
// return the href to our $next variable at the top of the chain
->attr('href')
;
// decrement pages
$pages--;
// Call scrape() recursively until either no url is present in $next
// or the page limit is reached i.e. $pages == 0
if($next and $pages > 0){
scrape($next);
}
}
// Output results as csv (because it's easy to copy paste into excel etc.)
function csv($results){
// Loop thru our results array returning both the key and value
foreach($results as $index => $result){
echo $index + 1
, ",'" . addslashes($result->name) ."'"
, "," . $result->link
, "\n"
;
}
}
// Scrape it!
scrape($url);
// Print out the results in csv format
csv($results);
// The End
And we end up with this:
Easy eh? Well maybe it requires a little thought but QueryPath sure makes it a lot simpler than it could’ve been especially if you have a little jQuery under your belt.
So that’s screen scraping with PHP. Check out QueryPath, it’s not limited to screen scraping, you can generate HTML with it too!
Phew, not a bad start for the 1st post on the re-boot of martinhurford.com. Let me know what you think.




10 Comments
Hi, I tried running your exact example, but line ’38′ keeps throwing an error…
Parse error: syntax error, unexpected ‘&’ in C:\….\test.php on line 38
Any reasons?
Cheers!
@Yemi, looks like I had a couple of errors with ‘>’ signs being converted to their html character entities ‘>’ which is why you are getting the ampersand ‘&’ error. They should all be fixed now. Thanks for letting me know.
I’m getting: Parse error: syntax error, unexpected T_FUNCTION in /scraper/qptest.php on line 29
$callback = function($resultsObj) use(&$results){
$results[] = $resultsObj;
};
I don’t totally understand the syntax, so I’m sure that’s part of the problem.
@Dan
The syntax looks correct so I’m going to guess that you are not using PHP version 5.3 which is required as it supports anonymous functions and closures.
dont think this works, I got the result but its blank.
It’s possible that Google has changed the markup on their SERPS page which would mean you’d get no results. Check the selectors in the code and see if they match up to those on the latest Google search pages….they tinker with this stuff all the time.
Awesome article.
This is precisely what I was looking for: both a library like QueryPath and an example of Google scraping with PHP.
Nice.
Glad you liked it!
Hello, i’m trying ur code
but it doesnt seem to go past the 1st page.
any help will be appreciated
As this article was written sometime ago it’s possible (and likely) that Google has changed the format of their search results pages. You will need to check that the selectors used in the code match those currently in use by Goggle. This is just a guide and you will need to tweak/debug the code based on your own use cases.