I’ve been shying away from web-scraping projects because of the amount of anti-scraping tech out there lately. But today I found a new package that makes web scraping a whole lot easier. I’m not sure how much anti-scraping software it can get around, but it worked for a recent project, and I’m quite pleased with it.
The software is called “Goutte” (pronounced ‘goot’, i.e. it rhymes with boot and not out).
Some example code, let’s say you’re trying to get the prices for all elements on a page that include a css class of ‘dollarAmountHere’:
use Goutte\Client; $client = new Client(); $url = 'https://somethingoranother.com/location'; $crawler = $client->request('GET', $url); $nodeValues = $crawler->filter('span.dollarAmountHere')->each(function ($node) { return $node->text(); });
$nodeValues now contains an array of the contents of the spans that matched the dollarAmountHere class. Very handy!
To get a list of links, try this:
$nodeValues = $crawler->filter('a.linkClassName')->each(function ($node) { return $node->attr('href'); });