Superb Web Scraping Tutorials by iAPDesign.com

Superb Web Scraping Tutorials using Laravel 4 by Developers.ph

Don't be shellfish...Share on Facebook0Tweet about this on TwitterShare on Google+4Pin on Pinterest2Share on StumbleUpon9Share on Reddit0

Good Day Guys, Today I will show you how easy to web scrape to any websites with just few PHP codes no more than 100 lines of code.. :)

Web Scraping is very well known for extracting data from the web automatically. For me, the best example of web scraping is goggle search. Why? Its because every time you post to your blog later on you will find it in the Google search it means that goggle indexes your blog post and its getting crawled every now and then every time you update your post.  Web scraping is most likely a human automation system, which act as human browsing software without any human interaction.

Superb Web Scraping Tutorials Using Laravel 4 By Developers.Ph

Superb Web Scraping Tutorials Using Laravel 4 By Developers.Ph

So today, Ill teach you how to build you own Superb Web Scraping in couple of minutes.

Requirements:

  • Laravel 4 Framework
  • Greater than or equal to PHP5.3
  • Composer (package required)
    • “fabpot/goutte”: “v2.0.1″  = Goutte A simple PHP Web Scraper
    • Visit packagist.com, click here Goutte

STEP 1

Please download Laravel 4 Application in your computer and setup it, this tutorial is for intermediate. If you’re new to Laravel Framework then I’ll make tutorials on how to setup Laravel 4 framework applications in your computer, just wait for that.

Go to packagist and search for fabpot/goutte package. It’s a Simple PHP Web Scraper Wrapper that used Symfony components that is the DomCrawler (Symfony\Component\DomCrawler\Crawler).

Open up your composer.json file, Add “fabpot/goutte”: “v2.0.1″ after laravel 4 framework inside the require object.

require": {
"laravel/framework": "4.1.*",
"fabpot/goutte": "v2.0.1"
}

Open up your terminal, and run

composer update

Once the package is successfully installed in your application, open again your terminal
and run composer dump-autoload to refresh the Autoloading Classes inside of your composer.

composer dump-autoload

STEP 2

So we are done setting up the required packages for our Web Scraping Application, lets move on setting up our Controllers.

Create new controller inside /app/controllers/ and named it “WebScraperController.php”.

And first, lets define our method classes first so we know the structure of our Class. Keep it mind that it’s much easier to create a methods first before doing any coding’s, and add comments too. Well, that’s what I always do. Okay, so add this codes to your new controller.

<?php

class WebScraperController extends BaseController {

	/**
	 * Defining our Dependency Injection Here.
	 * or Instantiate new Classes here.
	 *
	 * @return void
	 */
	public function __construct()
	{

	}

	/**
	 * This will be used for Outputing our Data
	 * and Rendering to browser.
	 *
	 * @return void
	 */
	public function getIndex()
	{

	}

	/**
	 * Setup our scraper data. Which includes the url that
	 * we want to scrape
	 *
	 * @param (String) $url = default is NULL
	 *		  (String) $method = Method Types its either POST || GET
	 * @return void
	 */
	public function setScrapeUrl($url = NULL, $method = 'GET')
	{

	}

	/**
	 * This will get all the return Result from our Web Scraper
	 *
	 * @return array
	 */
	public function getContents()
	{

	}

	/**
	 * It will handle all the scraping logic, filtering
	 * and getting the data from the defined url in our method setScrapeUrl()
	 *
	 * @return array
	 */
	private function startScraper()
	{

	}

}

I will explain first all the methods that i add to our new controller

  • __construct() – This will used for instantiating our new classes that we want to used and others. In this part i used a simple design pattern called Dependency Injection.
  • getIndex() – We used this as our default rendering View for our application, which used our blade templates and some css styling.
  • setScraperUrl() – Setting up our url that we want to be scrape
  • getContents() – this will be used for getting all the returned result and send to our getIndex() method and render to our user.
  • startScraper() – Handling web scraping logic, filtering, and returning array of results that will be processed by our getIndex() for rendering.

STEP 3

Now we setup already our class and skeleton methods we will now used the package that we download a while ago that is the Goutte\Client Classes. For more usage of this great tool, you can visit this link:
Goutte, a simple PHP Web Scraper

From the top of our class, add this two classes using use keyword in php.

<?php
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

class WebScraperController extends BaseController {
....

STEP 4

Now let’s defined our Class Property, that will be used widely in our whole application class. And initialized Goutte\Client inside of our __construct() method.


use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

class WebScraperController extends BaseController {

        private $client;
	public  $url;
	public  $crawler;
	public  $filters;
	public  $content = array();

        public function __construct(Client $client)
	{
		$this->client 	= $client;
	}
...

STEP 5

Go to our getIndex() method and lets add our View and setup our url that will be scrape.

public function getIndex()
{
	$this->url = 'http://code.tutsplus.com';
	return View::make('scraper');
}

STEP 6

We already define our View::make(‘scraper’), but we don’t have any views that is created. Lets create that. Go to your /views/ folder and create two files and named it.

index.blade.php // This will be used as our master layout.
scraper.blade.php // And this template is used for rendering our scraped data, and extending the master layout.

STEP 7

For this step, i used bower json for getting our assets, you can check it inside of this codes onced you download it. Its a self explanatory so you can understand it. For more information about bower just visit there website. Bower A Package manager

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
	<title>@yield('title', 'Superb Web Scraper Tutorial by iAPDesign.com')</title>

	<meta name="description" content="">
	<meta name="viewport" content="width=device-width">
	@yield('meta')

	@section('style')
		 <link rel="stylesheet" href="{{ URL::asset('assets/vendor/bootstrap/dist/css/bootstrap.min.css') }}">
		 <link rel="stylesheet" href="{{ URL::asset('assets/vendor/font-awesome/css/font-awesome.min.css') }}">
	@show
	@yield('stylesheets')
	<!-- jQuery -->
	<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
	<script>window.jQuery || document.write('<script src="{{ URL::asset("assets/vendor/jquery/jquery.min.js") }}"><\/script>')</script>
	@yield('script.header')

</head>
<body>

@yield('content')

@section('script.footer')
	 <!-- Script Footer -->
	 <script src="{{ URL::asset('assets/vendor/bootstrap/dist/js/bootstrap.min.js') }}"></script>
	 <script src="{{ URL::asset('assets/js/app.js') }}"></script>
@show

</body>
</html>

STEP 8

Now inside of scraper.blade.php add this code:
If you notice, i have a section of stylesheet inside of this code, If your wondering why it is below of my section content. dont worry because it will render at the top of our page because in our master template index.blade.php we define “@yield(‘stylesheets’)”.

For More information about this templating technique you can visit laravel 4 templates documentation.

@extends('index')

@section('content')
    <div class="container-fluid" style="background-color:#e8e8e8">
        <div class="container container-pad" id="property-listings">

            <div class="row">
              <div class="col-md-12">
                <h1>Superb Web Scraper Demo using Laravel 4  by Developers.ph</h1>
                <p>Web Scraping Contents</p>
              </div>
            </div>

            <div class="row">
                <div class="col-sm-12">
                @if($contents)
					@foreach ($contents as $content)
	                    <!-- Begin Listing: 609 W GRAVERS LN-->
	                    <div class="brdr bgc-fff pad-10 box-shad btm-mrg-20 property-listing">
	                        <div class="media">
	                            <a class="pull-left" href="{{ $content['url'] }}" target="_parent">
	                            <img alt="image" class="img-responsive" src="{{ $content['image_preview'] }}"></a>

	                            <div class="clearfix visible-sm"></div>

	                            <div class="media-body fnt-smaller">
	                                <a href="#" target="_parent"></a>

	                                <h4 class="media-heading">
	                                  <a href="{{ $content['url'] }}" target="_parent">
	                                  	{{ $content['title'] }}
	                                  </a>
	                                </h4>
	                                <p class="hidden-xs">
	                                	 {{ $content['short_description'] }}
	                                </p>
	                                <span class="fnt-smaller fnt-lighter fnt-arial">Author name: {{ $content['author'] }}</span>
	                            </div>
	                        </div>
	                    </div><!-- End Listing-->
	        		@endforeach
	        	@else
	        	<div class="well text-center"> No Result Found!</div>
	        	@endif
                </div>
        </div><!-- End container -->
    </div>
@stop

@section('stylesheets')

	<style type="text/css">
		/**** BASE ****/
		body {
		    color: #888;
		    background-color: #e8e8e8;
		}
		a {
		    color: #03a1d1;
		    text-decoration: none!important;
		}

		/**** LAYOUT ****/
		.list-inline>li {
		    padding: 0 10px 0 0;
		}
		.container-pad {
		    padding: 30px 15px;
		}

		/**** MODULE ****/
		.bgc-fff {
		    background-color: #fff!important;
		}
		.box-shad {
		    -webkit-box-shadow: 1px 1px 0 rgba(0,0,0,.2);
		    box-shadow: 1px 1px 0 rgba(0,0,0,.2);
		}
		.brdr {
		    border: 1px solid #ededed;
		}

		/* Font changes */
		.fnt-smaller {
		    font-size: .9em;
		}
		.fnt-lighter {
		    color: #bbb;
		}

		/* Padding - Margins */
		.pad-10 {
		    padding: 10px!important;
		}
		.mrg-0 {
		    margin: 0!important;
		}
		.btm-mrg-10 {
		    margin-bottom: 10px!important;
		}
		.btm-mrg-20 {
		    margin-bottom: 20px!important;
		}

		/* Color  */
		.clr-535353 {
		    color: #535353;
		}

		/**** MEDIA QUERIES ****/
		@media only screen and (max-width: 991px) {
		    #property-listings .property-listing {
		        padding: 5px!important;
		    }
		    #property-listings .property-listing a {
		        margin: 0;
		    }
		    #property-listings .property-listing .media-body {
		        padding: 10px;
		    }
		}

		@media only screen and (min-width: 992px) {
		    #property-listings .property-listing img {
		        max-width: 180px;
		    }
		}

	</style>

@stop

STEP 9

Now we already setup our views, lets make our getIndex() method some few codes.
For this demo, i will used code.tutplus.com to get the first page of data. And i check the defined classes and attributes of the html so i can get the exact data. Please check the codes below.

/**
	 * This will be used for Outputing our Data
	 * and Rendering to browser.
	 *
	 * @return void
	 */
	public function getIndex()
	{
		$this->url = 'http://code.tutsplus.com';
		$this->setScrapeUrl( $this->url );

		$this->filters = [
			'title'            => '.posts__post-title',
			'short_description'=> '.posts__post-summary',
			'image_preview'    => '.posts__post-preview img',
			'author' 	   => '.posts__post-author-link'
		];
		return View::make('scraper')->with('contents', $this->getContents());
	}

STEP 10

Using our setScrapeUrl() method and create a request to our defined $url that we pass from getIndex() method.

And setup our getContents() method to handle returned result for Web Scraper.

/**
* Setup our scraper data. Which includes the url that
* we want to scrape
*
* @param (String) $url = default is NULL
*		  (String) $method = Method Types its either POST || GET
* @return void
*/
public function setScrapeUrl($url = NULL, $method = 'GET')
{
		$this->crawler = $this->client->request($method, $url);
		return $this->crawler;
}

/**
* This will get all the return Result from our Web Scraper
*
* @return array
*/
public function getContents()
{
   return $this->content = $this->startScraper();
}

STEP 11

We’re almost done, now lets move on to our most important part of this tutorial, its the startScraper() method that handles all the scraping logic and returning all data to our getContents() method.

/**
* It will handle all the scraping logic, filtering
* and getting the data from the defined url in our method setScrapeUrl()
*
* @return array
*/
private function startScraper()
{
		// lets check if our filter has result.
               // im using CssSelector Dom Components like jquery for selecting data attributes.
		$countContent = $this->crawler->filter('.posts__post-title')->count();

		if ($countContent) {
			// loop through in each ".posts--list-large li" to get the data that we need.
		    $this->content = $this->crawler->filter('.posts--list-large li')->each(function(Crawler $node, $i) {
		    	return [
		           		'title' 			=> $node->filter($this->filters['title'])->text(),
		           		'url' 				=> $this->url.$node->filter($this->filters['title'])->attr('href'),
		           		'short_description' => $node->filter($this->filters['short_description'])->text(),
		           		'image_preview' 	=> $node->filter($this->filters['image_preview'])->attr('src'),
		           		'author' 			=> $node->filter($this->filters['author'])->text()
		        ];
		    });
		}
		return $this->content;
}

Goutte Class used filter() method that gets html data attributes that based on DomCrawler used by symfony that manipulates html/xml components and easy extraction of data. Goutte Class can also act as a human that can simulates user click, can submit form etc.
Right now, im using CssSelector same as jquery selection of classes, but you can used also filterXPath() that works also for HTML and XML Dom Manipulation.

Example you want to select all

tags inside your body tag. you can do like this.

$crawler = $this->crawler->filter('body > p');

Final Codes

Now we’re done, and take a look our final code for our WebScraperController.php.

<?php
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;

class WebScraperController extends BaseController {

	private $client;
	public  $url;
	public  $crawler;
	public  $filters;
	public  $content = array();

	/**
	 * Defining our Dependency Injection Here.
	 * or Instantiate new Classes here.
	 *
	 * @return void
	 */
	public function __construct(Client $client)
	{
		$this->client 	= $client;
	}

	/**
	 * This will be used for Outputing our Data
	 * and Rendering to browser.
	 *
	 * @return void
	 */
	public function getIndex()
	{
		$this->url = 'http://code.tutsplus.com';
		$this->setScrapeUrl( $this->url );

		$this->filters = [
			'title' 			=> '.posts__post-title',
			'short_description' => '.posts__post-summary',
			'image_preview' 	=> '.posts__post-preview img',
			'author' 			=> '.posts__post-author-link'
		];
		return View::make('scraper')->with('contents', $this->getContents());
	}

	/**
	 * Setup our scraper data. Which includes the url that
	 * we want to scrape
	 *
	 * @param (String) $url = default is NULL
	 *		  (String) $method = Method Types its either POST || GET
	 * @return void
	 */
	public function setScrapeUrl($url = NULL, $method = 'GET')
	{
		$this->crawler = $this->client->request($method, $url);
		return $this->crawler;
	}

	/**
	 * This will get all the return Result from our Web Scraper
	 *
	 * @return array
	 */
	public function getContents()
	{
		return $this->content = $this->startScraper();
	}

	/**
	 * It will handle all the scraping logic, filtering
	 * and getting the data from the defined url in our method setScrapeUrl()
	 *
	 * @return array
	 */
	private function startScraper()
	{
		// lets check if our filter has result.
		// im using CssSelector Dom Components like jquery for selecting data attributes.
		$countContent = $this->crawler->filter('.posts__post-title')->count();

		if ($countContent) {
			// loop through in each ".posts--list-large li" to get the data that we need.
		    $this->content = $this->crawler->filter('.posts--list-large li')->each(function(Crawler $node, $i) {
		    	return [
		           		'title' 			=> $node->filter($this->filters['title'])->text(),
		           		'url' 				=> $this->url.$node->filter($this->filters['title'])->attr('href'),
		           		'short_description' => $node->filter($this->filters['short_description'])->text(),
		           		'image_preview' 	=> $node->filter($this->filters['image_preview'])->attr('src'),
		           		'author' 			=> $node->filter($this->filters['author'])->text()
		        ];
		    });
		}
		return $this->content;
	}

}

Conclusion

So there you go, we just completed our Superb Web Scraping Tutorial that you can used for your projects. You can used this as your based to come up with new techniques and modification in the current tutorial.

Feel free to comment out below for any questions and clarification or Objections. :)
I will answer it all. Peace Out..

Developers.ph - We are Awesome. Tutorials in different programming languages like PHP,HTML5,JAVASCRIPT,CSS/CSS3,JQuery,Jquery Mobile, Responsive Design.

Leave a Reply

*

7 comments

  1. Christophe Hubert

    Alright, works like a charm…
    Now I do get hardcore and I scrap a website that has 100s of links.
    I get a timeout from php after 30sec of execution (default)
    Beside increasing the time of php execution in php.ini, do you have any cleaner way to still output the page but prevent the timeout ??

    Cheers !

    • Well, the most cleaner way is to process it in the background.. If you’re running this scrapping in browser, It will hit the php timeout or it will make your browser crash..

      If you know how to use laravel queue with the used of Beanstalkd for fast work queue. I’ve been doing that all time…

      Even though its running in the background, functionality wise make sure you put a timeout for each run, Example for the first 50 scrape links you should put a delay then run the next 50 again.. so on and so fort… In that case your IP address will not be banned for that website.. thats an example only.. :)

  2. kreker

    Very nice and comprensive tutorial!
    What about lazyimage ? Tried with a page that has lazyimage, so the image will load only on scroll events, the scraper can’t scroll…how to solve?
    many thanks

  3. mansoor

    hay i coded exactly described by you but its nto working in using laravel 4.2
    please help me fix it,how to route to this file??
    you didn not mentioend routing

    i m using this
    Route::controller(‘/scraper’, ‘WebScrapperController@getIndex’);
    it says somethign went wrong

    • Just check your logs at app/storage/logs, then you will find whats wrong.. or else just paste the error here so i can help you out.. :)

  4. Chris

    I’ve spent hours trying to get this to work. I too am unsure about the routing – just how should routes.php look with this set-up?
    Does it need to be amended in anyway for Laravel 5.2?

    Thanx in advance

Next Article20 Useful Javascript Library to Improve Your Applications