Search Engine Scraper source code

Project offered by compunect [scraping@compunect.com] last successful test run: 28 Jan 2016

This advanced PHP source code is developed to power scraping based projects. While the code can already be used from console (or browser) this source is mainly a base for customization. You can either customize this project by yourself or hire us to do what we can do best. compunect is an IT services and development company founded in Germany and now situated in Czech Republic focused at professional customers.

This free Search Engine Scraper already includes:

This scraper can operate 24 hours a day 7 days a week without getting blocked
Full support for the Google search engine
Scraping a list of keywords
Detection of organic results
Iterating through multiple result pages (configureable)
Scraping accurate global but targets also local results (by country) when using highest quality US IP addresses.
Supports Google filters (configureable)
Proper IP management, it can use our IP service API and automatically acquire IP addresses.
It will use proper delays between requests and prevents getting banned.
Data cache and history to prevent unrequired and overusage of IP addresses.
Accessible source code design to make customization easier
Perfectly suitable as background process in Linux environments

Scraping search engines became a serious business in the past years and it remains a very challenging task. We know how difficult it can be to find an experienced developer in this area and it is hardly possible at all to find detailed information online. We took quite a step by providing this source code for free as it contains rare knowledge and there is nothing else comparable available. We still release this for free, you may use this source code in your commercial project without paying us a cent. However, if you require customization or additional features we offer such services, after all who else could do it better ? If you require a professionally managed Linux server to run your projects on: we can help you to get this accomplished at a fair rate. You definitely will require high quality, dedicated IP addresses to power your project. We offer these services as well and would be glad to find a solution for you. If you are interested in scraping projects, check out the Google Suggest Scraping Spider as well. The Suggest Scraper can generate thousands of organic search relevant terms to be scraped.

More to know about scraping

It took us months of testing and developing to get accurate results from Google when using automated scripts. This source code already includes most of this work. We even included the possibility to gather local search results, so you can scrape results from any country without using IP addresses from that country. However, to receive correct results you will also need exceptional good IP addresses. We can provide this for you if you struggle to do it on your own. Extending the source to work for Bing, Yahoo or another search engine should not be a big leap as many of the core functions will stay similar.

What to do with this tool?

There are countless very interesting activities where this scraper comes in handy. Do you invest in Google adwords to have your websites ranked for competitive search terms ? Then you likely struggle with all those thousands of keywords Google wants you to invest money in, which ones to choose and which ones are a waste of money? Imagine being able to check your website rank for thousands of keywords and key phrases and only pay for those where your website is not ranked good enoughy. You can even automate the whole process using the adwords API to pay according to your organic rank per keyword and update this monthly. And on top of Googles own suggestions, maybe there are hundreds of oragnic relevant key phrases you do not even know about ? Use the Google Suggest Scraping Spider to find what people are really looking for, then use this Google Search Scraper to find out if you are ranked already. Are you optimizing your websites for Google or are you in the SEO business optimizing for your customers ? Track thousands of websites and keywords to see where you have to invest work. That way you can also track the efficiency of your various methods to improve the rank. Or go one step further and offer your customers a graph for all their websites and keywords which shows how well your work has influenced the ranks. Or go even one more step further and analyze the ranks of hundreds of thousands worldwide companies. You can use our Google Finance Scraping Spider to get all the companies out of Google Finance. You may also make the whole project interactive for users, let them get ranks or charts according to their keywords and websites. Of course this project can also be used to just brute force get massive amounts of URLs, titles according to a set of keywords. By doing regular scrape runs and putting the results into a database with timestamp you can unleash the real power of this project, if you need help to develop such extensions I am ready for hire.

IP/Proxy management

When scraping it is most essential to avoid detection. Google would ban any user who tries to automatically scrape their search engine results. In the worst case they can throw out a ban which blocks ten thousands of IP addresses permanently. This is usually all that happens, it threatens the project but not the legal entity behind it. However there is also a legal threat. If you do not accept the search engine TOS you should not have legal threats with passively scraping it. To make sure about that you need to consult your local lawyer. In any case it is possible to avoid getting detected, the free Search Engine Scraper on this website can be used longterm without detection. a) It will send Google requests at a rate of 10 requests per hour per IP address. b) It will calculate a proper delay between each request. c) It will not accept any tracking offered by Google. d) It will rotate the IP address at the correct moments. e) It will keep a local data cache and IP history.

Google captcha blocks automated access

If following these guidelines a block by captcha due to your own actions are very unlikely. When using a different IP/Proxy service the reason most likely come from shared IP usage or previous abuse. The Google Search Scraper from here already contains code to detect, detection and abort in that case. There are different typical error messages Google issues when it decided to block or slow down activity. Here are two examples:

We're sorry... ... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now. We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software. If you're continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser's online support center. If your entire network is affected, more information is available in the Google Web Search Help Center. We apologize for the inconvenience, and hope we'll see you again on Google. To continue searching, please type the characters you see below:

Often a captcha is offered to continue searching, in the worst case Google completely blocks all access to one or all services for one or multiple IPs. This is a worst case scenario, if you stick to the peak rates and use IPs from us-proxies.com it is unlikely you will run into this problem.

US-Proxy support

This project runs through a US Proxy service, powered through the supplied API it is possible to scrape millions of results without getting blocked. The benefit of using us-proxies.com is an easily extendable IP service providing the best IP quality in the industry at a fair price aimed toward professionals. However, the code is not limited to this particular service. You are free to adapt the source to suit your needs.

Google Search Scraper PHP code

The source code is written in PHP and is ready to be used immediately. You can either make an agreement with us-proxies for IP addresses or replace the relevant parts and use your own IP solution. Before using the source code please read the license agreement.

Example output

Here is an example result-set from a test-run:

Keyword: Scraping PHP

!Ranking information for keyword "Scraping PHP" !
!Rank [Type] - Website -  Title!
[organic] - http://stackoverflow.com/questions/34120/html-scraping-in-php - HTML Scraping in Php - Stack Overflow 
[organic] - http://www.oooff.com/php-scripts/basic-php-scrape-tutorial/basic-php-scraping.php - Basic PHP Web Scraping Script Tutorial - Oooff.com 
[organic] - http://anchetawern.github.io/blog/2013/08/07/getting-started-with-web-scraping-in-php - Getting Started with Web Scraping in PHP - Wern Ancheta 
[organic] - http://simplehtmldom.sourceforge.net/ - PHP Simple HTML DOM Parser 
[organic] - http://www.jacobward.co.uk/web-scraping-with-php-curl-part-1/ - Web Scraping With PHP & CURL [Part 1] | Jacob WardJacob Ward 
[organic] - https://github.com/fabpot/goutte - fabpot/Goutte Â· GitHub 
[organic] - http://www.instructables.com/id/Beginning-web-page-scraping-with-php/ - Beginning web page scraping with php. - Instructables 
[organic] - https://www.phparch.com/books/phparchitects-guide-to-web-scraping-with-php/ - php|architect's Guide to Web Scraping with PHP Â« php[architect ... 
[organic] - http://scraping.pro/scraping-in-php-with-curl/ - Scraping in PHP with cURL - Web Scraping 
[organic] - http://www.ymc.ch/en/webscraping-in-php-with-guzzle-http-and-symfony-domcrawler - Webscraping in PHP with Guzzle HTTP and Symfony DomCrawler ... 
[organic] - https://code.tutsplus.com/tutorials/html-parsing-and-screen-scraping-with-the-simple-html-dom-library--net-11856 - HTML Parsing and Screen Scraping with the Simple HTML DOM ... 
[organic] - http://www.eppie.net/simple-php-scraper-class/ - Simple PHP Scraper Class | - Eppie.net 
[organic] - http://jacerdass.wordpress.com/2013/07/17/web-scrapping-done-right-using-php/ - Web scraping done right using PHP | Jacer Omri's Blog 
[organic] - http://www.youtube.com/watch?v=632ql93H90g - Scraping Websites with PHP using DOMDocument and DOMXpath ... 
[organic] - http://www.youtube.com/watch?v=Uv4eASStpas - PHP web scraping tutorial 1 : Automated Registration Form - YouTube 
[organic] - http://www.devhour.net/filling-out-forms-with-php-and-curl/ - Scraping data with PHP and cURL Devhour 
[organic] - http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial - Easy web scraping with PHP - The Future of the Web Â» Articles Â» 
[organic] - http://www.devblog.co/php-web-page-scraping-tutorial/ - PHP Web Page Scraping Tutorial | DevBlog.co 
[organic] - http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creating-insightful-content/ - 6 tools for scraping - Use for datajournalism & insightful content 
[organic] - http://www.packtpub.com/web-scraping-with-php/book - Instant PHP Web Scraping [Instant] | Packt Publishing 
[organic] - http://showmethecode.es/php/php-goutte-una-libreria-para-hacer-web-scraping/ - PHP: Goutte una librerÃ­a para hacer web scraping - Show me the code 
[organic] - http://www.sitepoint.com/image-scraping-symfonys-domcrawler/ - Image Scraping with Symfony's DomCrawler - SitePoint 
[organic] - http://www.amazon.com/Instant-PHP-Scraping-Jacob-Ward-ebook/dp/B00E7NC9CS - Amazon.com: Instant PHP Web Scraping eBook: Jacob Ward: Kindle ... 
[organic] - http://code.google.com/p/universal-web-scraper/ - Universal Web Scraper - Google Code 
[organic] - http://hackaday.com/2012/12/10/web-scraping-tutorial/ - Web scraping tutorial - Hack a Day 
[organic] - http://www.tdbowman.com/?p=426 - Web Scraping Using PHP and jQuery | Managing My Impression 
[organic] - https://barebonescms.com/documentation/ultimate_web_scraper_toolkit/ - Ultimate Web Scraper Toolkit Documentation - Barebones CMS 
[organic] - http://www.phpclasses.org/package/1754-PHP-Extract-structured-data-from-remote-HTML-pages.html - PHP Scraper: Extract structured data from remote HTML pages ... 
[organic] - http://imbuzu.wordpress.com/2013/06/26/web-scraping-with-php/ - Web Scraping with PHP | Buzu's Oficial Blog 
[organic] - http://developer.yahoo.com/yql/guide/yql-code-examples.html - YQL Code Examples - YDN 
[organic] - http://blog.wlindley.com/2013/07/easy-screen-scraping-in-php/ - Easy screen scraping in PHP | A journal of my take on this wacky world 
[organic] - http://www.mozenda.com/php-screen-scrape - PHP Screen Scrape Software Program Tool set - Mozenda 
[organic] - http://acrl.ala.org/techconnect/?p=3850 - Web Scraping: Creating APIs Where There Were None ACRL ... 
[organic] - http://www.matthewwatts.net/tutorials/php-tutorial-2-advanced-data-scraping-using-curl-and-xpath/ - PHP Tutorial 2: Advanced Data Scraping Using cURL And XPATH ... 
[organic] - http://www.martinhurford.com/screen-scraping-with-php-querypath.html - Screen Scraping with PHP and QueryPath - Martin Hurford 
[organic] - http://www.webmasterworld.com/php/4652704.htm - Web scraping PHP Server Side Scripting forum at WebmasterWorld 
[organic] - https://leanpub.com/web-scraping - Web Scraping for PHPâ€¦ by sameer borate [Leanpub PDF/iPad/Kindle] 
[organic] - http://www.akshitsethi.me/parsing-web-pages-in-php/ - Parsing web pages in PHP | Akshit Sethi 
[organic] - http://google-scraper.squabbel.com/ - Scraping Google for Fun and Profit 
[organic] - http://blog.cnizz.com/2012/10/12/scrape-faster-with-php-domdocument-and-safely-with-tor/ - Scrape Faster with PHP DomDocument and Safely with Tor | Chris ... 
[organic] - https://classic.scraperwiki.com/docs/php/php_intro_tutorial/ - Documentation / First scraper tutorial | ScraperWiki 
[organic] - http://snipplr.com/view/22188/ - Easy scraping and HTML parsing with PHP5 and XPath - PHP ... 
[organic] - http://lab.abhinayrathore.com/imdb/ - Free PHP ASP.net C# VB.net IMDb Scraper API and Web Service ... 
[organic] - http://www.lie-nielsen.com/scraping-planes/large-scraping-plane/ - Large Scraping Plane - Lie-Nielsen Toolworks 
[organic] - http://saturnboy.com/2010/03/scraping-google-groups/ - Scraping Google Groups Â« Saturnboy 
[organic] - http://www.amazon.co.uk/Instant-PHP-Scraping-Jacob-Ward/dp/1782164766 - Instant PHP Web Scraping: Amazon.co.uk: Jacob Ward: Books 
[organic] - http://sledgedev.com/build-a-scraper-with-php/ - Sledge Dev â€“ Build a scraper with php 
[organic] - http://www.maxprog.com/forum/viewtopic.php?f=11 - Maxprog Forum â€¢ View topic - scraping php sites? 
[organic] - http://books.google.com/books?id=Q-cEMrCWckkC - Instant PHP Web Scraping - Google Books Result 
[organic] - https://www.odesk.com/o/profiles/users/_~01d067ffb7cb06ee0e/ - Sandip Debnath - Proxy&Login-Bots/Scraping/Php/Regex/Ai/Ajax ... 
[organic] - http://saf33r.com/web-scraping-101-with-php-and-goutte - Web Scraping 101 with PHP and Goutte | Safeer 
[organic] - http://www.redscraper.com/blog/basic-of-web-scraping-using-php/ - Basic of Web Scraping Using PHP | Redscraper Blog 
[organic] - http://skybluesofa.com/blog/how-use-phps-domdocument-scrape-web-page/ - How to Use PHP's DOMDocument to Scrape a Web Page - Sky Blue ... 
[organic] - http://webdata-scraping.com/data-scraping-pdf-files-using-php/ - How to do data scraping from PDF files using PHP? | WebData ... 
[organic] - http://www.russellbeattie.com/blog/using-php-to-scrape-web-sites-as-feeds - Using PHP to scrape web sites as feeds - Russell Beattie 
[organic] - http://www.slideshare.net/tobias382/web-scraping-with-php-presentation - Web Scraping with PHP - SlideShare 
[organic] - http://www.hochmanconsultants.com/articles/stop-email-spam.shtml - Code to Prevent Email Address Scraping and Form Spam via PHP ... 
[organic] - http://www.developertutorials.com/tutorials/php/easy-screen-scraping-in-php-simple-html-dom-library-simplehtmldom-398/ - Easy Screen Scraping in PHP with the Simple HTML DOM Library 
[organic] - http://thinkdiff.net/php/php-for-web-scraping-and-bot-development/ - PHP for Web scraping and bot development | Thinkdiff.net 
[organic] - http://rojan.com.np/scraping-nodejs-vs-php/ - Rojan's blog | Scraping â€“ Nodejs Vs Php 
[organic] - http://www.barattalo.it/2013/12/08/php-jquery-dom-navigating-scrape-spider/ - Scraping content with PHP as if was jQuery, PHP jQuery like methods 
[organic] - http://www.webdeveloper.com/forum/showthread.php?230985-Blocking-php-curl-from-scraping-website-content - Blocking php curl from scraping website content - WebDeveloper.com 
[organic] - http://www.reddit.com/r/PHP/comments/1xiygj/what_is_the_best_php_library_for_scraping/ - What is the best php library for scraping websites, and filling out ... 
[organic] - http://neerajpro.wordpress.com/2013/09/16/web-scraping-and-bot-development-using-php/ - web scraping and bot development using PHP | OPEN LEARNING 
[organic] - http://wiki.vuze.com/w/Scrape - Scrape - VuzeWiki 
[organic] - http://papermashup.com/use-jquery-and-php-to-scrape-page-content/ - Use jQuery and PHP to scrape page content | Papermashup.com 
[organic] - http://ctrlq.org/code/19064-web-scraping-amazon - Web Scraping Amazon with PHP | The Programmer's Library 
[organic] - https://www.facebook.com/apps/site_scraping_tos_terms.php - Automated Data Collection Terms - Facebook 
[organic] - http://tyler.io/2008/05/scraping-imdb-with-php/ - Scraping IMDB With PHP | tyler.io 
[organic] - http://www.coderanch.com/t/549196/PHP/Solved-Regular-Expressions-Scraping - [Solved] Help With Regular Expressions/Scraping (PHP forum at ... 
[organic] - http://web3o.blogspot.com/2010/10/php-imdb-scraper-for-new-imdb-template.html - FREE! PHP IMDb Scraper/API for new IMDb Template 
[organic] - http://superuser.my/web-scraping-ganon-php/ - Web Scraping Using Ganon PHP Library | superuser.my 
[organic] - http://www.hmp.is.it/scraping-a-site-with-php/ - Simple way of scraping a website using PHP - hmp.is.it 
[organic] - http://www.scriptrr.com/ - Website Scraper | Forum Crawler | Screen Scrapping | Data Mining ... 
[organic] - http://scraperblog.blogspot.com/2013/07/php-scrape-website-with-rotating-proxies.html - ScraperBlog: Php - scrape website with rotating proxies 
[organic] - http://wiki.xbmc.org/index.php?title=Naming_video_files/TV_shows - Naming video files/TV shows - XBMC 
[organic] - https://packagist.org/search/?tags=scraper - Scraper - Packagist 
[organic] - http://codedit.com/php/web-scraping-with-php-curl - Codedit.com | Web Scraping with PHP & CURL 
[organic] - http://www.screen-scraper.com/products/all.php - Web scraping software | screen-scraper.com 
[organic] - http://www.warriorforum.com/programming-talk/530802-scraping-websites-use-php-regexp-something-else.html - Scraping websites - use PHP and Regexp or something else ... 
[organic] - http://codeatomic.com/services/web-scraping/ - Code Atomic Web scraping php (web harvesting or web data ... 
[organic] - http://jon.netdork.net/2011/02/21/nagios-web-scraping-and-php-as-an-agent - Nagios, web scraping, and PHP as an agent - TheGeekery 
[organic] - http://www.scrapegoat.com/faqs.php - FAQs Page - Data Mining and Screen Scraping from ScrapeGoat.com 
[organic] - http://www.indeed.com/q-PHP-Scraping-jobs.html - PHP Scraping Jobs, Employment | Indeed.com 
[organic] - https://forums.digitalpoint.com/threads/php-screen-scraping-specific-data.2680501/ - PHP screen scraping specific data - Digital Point Forums 
[organic] - http://codereview.stackexchange.com/questions/40538/why-is-my-web-scraping-script-so-slow - php - Why is my web scraping script so slow? - Code Review Stack ... 
[organic] - http://www.freelancer.com/jobs/Web-Scraping/ - Web Scraping Jobs and Contests | Freelancer.com 
[organic] - http://www.fiverr.com/systemexpert/code-a-php-scraper-that-will-scrape-5-items-from-a-website-of-your-choice - code a php scraper that will scrape 5 items from a website of your choi 
[organic] - http://forums.macrumors.com/showthread.php?t=1689584 - Setting up a web scraping system - MacRumors Forums 
[organic] - http://www.4shared.com/office/CC-9NLJn/php_architects_guide_to_web_sc.html - php architect's guide to web scraping with php - Download - 4shared 
[organic] - http://raphaelstolt.blogspot.com/2008/10/scraping-websites-with-zenddomquery.html - : Scraping websites with Zend_Dom_Query 
[organic] - http://www.codefire.org/blogs/item/data-scraping-using-curl-in-php.html - Data scraping using cURL in PHP - CodeFire 
[organic] - http://matthewturland.com/2010/04/20/web-scraping-with-php-now-available/ - Matthew Turland Â» Blog Archive Â» â€œWeb Scraping with PHPâ€ Now ... 
[organic] - http://www.xmarks.com/site/www.bradino.com/php/screen-scraping/ - PHP Screen Scraping Tutorial - Xmarks 
[organic] - http://www.dmxzone.com/go/4402/page-scraping/ - Page Scraping - Articles - DMXzone.COM 
[organic] - http://blog.makewebsmart.com/scraping-library-for-codeigniter-framework/136 - Scraping library for CodeIgniter Framework | MakeWebSmart 
[organic] - http://www.phpninja.info/blog/2013/08/crawling-scraping-app-store-andor-android-market/ - Crawling and Scraping App Store and/or Android Market - Php Ninja 
[organic] - http://www.phpdeveloper.org/tag/scraping - scraping - PHPDeveloper: PHP News, Views and Community 
[organic] - http://www.phpbuilder.com/columns/marc_plotz011410.php3 - PHPBuilder - Build a PHP Link Scraper with cURL 
[organic] - http://www.archiveteam.org/index.php?title=URLTeam - URLTeam - Archiveteam 
[organic] - http://devzone.zend.com/1087/php-abstract-episode-22-screen-scraping/ - PHP Abstract Episode 22: Screen Scraping | Zend Developer Zone 
[organic] - http://www.ngo-hung.com/blog/2012/11/03/list-of-open-source-screen-scraping-tools - List of open source screen scraping tools - Ngo The Hung's blog 
[organic] - http://entropytc.com/screen-scraping-with-php/ - Screen scraping with PHP - Entropy Technical Consulting 
[organic] - http://www.fromzerotoseo.com/scraping-websites-php-curl-proxy/ - Scraping websites with PHP cURL under proxy | From Zero To SEO 
[organic] - http://www.yiiframework.com/extension/yiiscrapermodule/ - yiiscrapermodule | Extension | Yii PHP Framework 
[organic] - https://docs.google.com/document/d/18Q2THQvYCG2_n6nKVsZRHlaPG9iJ9NvLezOOQbEuAJs/edit?hl=en - Tipsheet: Web Scraping for Non-Programmers - Google Drive 
[organic] - http://www.digeratimarketing.co.uk/2008/12/16/curl-page-scraping-script/ - CURL Page Scraping Script - Digerati Marketing 
[organic] - http://www.shekhargovindarajan.com/scripts/web-scraping-with-firefox-and-php-using-xpath/ - Web Scraping with Firefox and PHP, using XPath | Shekhar ... 
[organic] - http://www.quickscrape.com/ - QuickScrape | Quick php html scraper and crawler for scraping and ... 
[organic] - http://www.linkedin.com/groups/Php-Web-Html-Content-Scraping-4818098 - Php Web Html Content Scraping Help | LinkedIn 
[organic] - http://forum.codecall.net/topic/77005-scraping-charts-from-this-website/ - Scraping charts from this website? - PHP - Codecall 
[organic] - https://www.elance.com/r/contractors/q-PHP%20cURL%20Data%20Scraping - Find PHP cURL Data Scraping Freelancers & Contractors 
[organic] - http://php.dzone.com/news/gotcha-scraping-net - Gotcha on Scraping .NET Applications with PHP and cURL | PHP ... 
[organic] - https://itunes.apple.com/us/book/instant-php-web-scraping/id680880119?mt=11 - iTunes - Books - Instant PHP Web Scraping by Jacob Ward 
[organic] - http://www.zacharydavidbiles.com/2012/05/scraping-pinterest-with-php/ - Scraping Pinterest with PHP | Zach Biles â€“ Cartersville, GA Web ... 
[organic] - http://www.ebook3000.com/php-architect-s-Guide-to-Web-Scraping-with-PHP_113893.html - php|architect's Guide to Web Scraping with PHP - Free eBooks ... 
[organic] - http://www.weblee.co.uk/2009/06/18/simple-dom-helper-for-codeigniter/ - Simple Dom Helper codeigniter | Screen Scraping | PHP ... - Web Lee 
[organic] - http://www.nicolasmarin.com/web-scraper-con-php/ - Web scraper con PHP | NicolÃ¡s MarÃ­n 
[organic] - http://www.quora.com/Web-Scraping/How-do-you-scrape-asp-or-php-pages - Web Scraping: How do you scrape .asp or .php pages? - Quora 
[organic] - http://www.urbandictionary.com/define.php?term=scraper - Urban Dictionary: scraper 
[organic] - http://forums.phpfreaks.com/topic/276972-scraping-the-data-from-website/ - scraping the data from website - PHP Coding Help - PHP Freaks 
[organic] - http://www.h-net.org/reviews/showrev.php?id=37101 - H-Net Reviews 
[organic] - http://www.connotate.com/technology/product - Automated Web Data Collection | Intelligent Web Scraping | Hosted ... 
[organic] - http://phptrends.com/dig_in/scraping - scraping - PHP Trends, libraries and frameworks 
[organic] - http://www.tonido.com/blog/index.php/2013/12/28/web-scraping-and-legal-issues/ - Web Scraping and Legal Issues - Tonido 
[organic] - http://elanmarikit.me/2011/03/scraping-aspnet-page-in-php-curl.html - Scraping ASP.NET page in PHP Curl | PHP/Web Development 
[organic] - http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/ - Scraping table from any web page with R or CloudStat | (R news ... 
[organic] - http://www.peopleperhour.com/freelance/web+scraping+php+curl - Web scraping php curl - PeoplePerHour.com 
[organic] - http://dayat.net/introduction-to-scraping-techniques/ - Introduction To Scraping Techniques | Dayat Technologies 
[organic] - http://robertbasic.com/blog/book-review-guide-to-web-scraping-with-php - Book review - Guide to Web Scraping with PHP ~ Robert Basic ~ the ... 
[organic] - http://forums.whirlpool.net.au/archive/1983474 - Running a PHP scraping script - Programming - Whirlpool Forums 
[organic] - http://www.adminspoint.com/programming/296-easy-screen-scraping-php-server-side-scripting-language-simple-html-dom-library.html - Easy Screen Scraping in PHP with the Simple HTML DOM Library 
[organic] - http://www.hotscripts.com/forums/php/114448-data-scraping-question.html - Data Scraping Question - Hot Scripts Forums 
[organic] - http://www.pearltrees.com/mic100/php-scraping/id4775553 - Php scraping | Pearltrees 
[organic] - http://hublog.hubmed.org/archives/001558.html - HubLog: Scraping web pages with PHP 5 
[organic] - http://blog.hartleybrody.com/web-scraping/ - I Don't Need No Stinking API: Web Scraping For Fun and Profit 
[organic] - http://www.blackhatworld.com/blackhat-seo/black-hat-seo/565471-dev-php-crawler-scraping-video-sites.html - [DEV] PHP crawler for scraping video sites - Black Hat World 
[organic] - http://deepinthecode.com/2014/02/28/scraping-div-element-web-page-php/ - Scraping a DIV Element from a Web Page with PHP â€“ Deep in the ... 
[organic] - http://ao2.it/en/blog/2013/07/07/tweeper-twitter-rss-web-scraper - Tweeper: a Twitter to RSS web scraper | en hacking | ao2.it 
[organic] - http://bz9.com/index.php/youtube-scraper/ - YouTuber :: YouTube Scraper - BZ9.com 
[organic] - https://phpacademy.org/topics/html-web-scraping-with-php/33032 - HTML Web Scraping with PHP | phpacademy 
[organic] - http://blogoscoped.com/archive/2004_06_23_index.html - Screen-scraping With PHP5 | Googlebot Alert | Gmail Hype Ending ... 
[organic] - http://superuser.com/questions/179253/how-legal-is-site-scraping-using-curl - php - How "legal" is site-scraping using cURL? - Super User 
[organic] - http://osdir.com/ml/org.user-groups.php.uphpu/2008-09/msg00075.html - org.user-groups.php.uphpu - Web site scraping - msg#00075 ... 
[organic] - http://my.safaribooksonline.com/book/programming/php/9781782164760/1dot-instant-php-web-scraping/ch01s09_html - Instant PHP Web Scraping > 1. Instant PHP Web Scraping ... 
[organic] - https://discussion.dreamhost.com/thread-125593.html - php curl screen scraping program needs an if fork - DreamHost Forum 
[organic] - http://www.daniweb.com/web-development/php/threads/289020/blocking-php-curl-from-scraping-website-content - Blocking php curl from scraping website content | DaniWeb 
[organic] - http://leandroarts.com/how-to-scrape-google-search-results-for-query-popularity-with-php/ - How to scrape Google search results for query popularity with PHP ... 
[organic] - http://jimblackler.net/blog/?p=13 - Jim Blackler Â· Scraping text from Wikipedia using PHP 
[organic] - http://www.mishainthecloud.com/2009/12/screen-scraping-aspnet-application-in.html - Misha in the Cloud: Screen-scraping an ASP.NET application in PHP 
[organic] - http://ehelion.net/projects/htmlscrape/scrape.html - Collecting data using HTML scraping - ehelion.com 
[organic] - http://www.wellho.net/resources/ex.php4?item=h307/scraper.php - Scraping a remote URL content - PHP example 
[organic] - http://horusss2.wordpress.com/2009/12/05/use-php-dom-parser-for-more-robust-screen-scraping/ - Use PHP DOM Parser for more robust screen scraping | THIS BLOG ... 
[organic] - http://www.amitsamtani.com/2010/03/30/web-scraping-using-php-and-xpath/ - Web Scraping using PHP and XPath - amitsamtani.com 
[organic] - http://99webtools.com/extract-website-data.php - Extract website data using php - Web tools 
[organic] - http://www.iwebscraping.com/Web_Scraping_Service.php - Web Scraping Service | Web Data Scraping | Website Scraping 
[organic] - http://www.windbusinessfactor.it/storage/video/1309/-php-architects--guide-to-web-scraping-with-php.pdf - php|architect's Guide to Web Scraping with PHP - Wind Business ... 
[organic] - http://www.computerhope.com/forum/index.php?topic=129466.0 - PHP cURL (Scraping a website) - Computer Hope 
[organic] - http://scrapedefender.com/education/web-scraping-job-listings/ - Data and Web Scraping Job Listings | Scrape Defender 
[organic] - http://wordpress.org/plugins/wp-web-scrapper/other_notes/ - WordPress â€º WP Web Scraper Â« WordPress Plugins 
[organic] - http://phpcircle.net/content/website-scraping-advantages-php - Website Scraping Advantages With PHP !! | PHPCircle 
[organic] - http://devtrench.com/posts/screen-scrape-with-php-curl - Screen Scraping: How to Screen Scrape a Website with PHP and ... 
[organic] - http://forums.devshed.com/php-development-5/scraping-aspx-site-php-799426.html - Scraping an aspx site with php - Dev Shed Forums 
[organic] - http://www.internetnews.com/ec-news/article.php/3334651 - Google Moves to Block RSS Scraping - InternetNews. 
[organic] - http://softadvice.informer.com/Php_Email_Scraper.html - Php Email Scraper - free download suggestions - Software Advice 
[organic] - http://sourabhjainblog.wordpress.com/2013/11/13/scraping-websites-with-php-curl-under-proxy/ - Scraping websites with PHP cURL under proxy | Sourabh Jain - php ... 
[organic] - http://nbviewer.ipython.org/url/www.unc.edu/~ncaren/Lax-1.ipynb.json - Web scraping in Python - IPython Notebook Viewer 
[organic] - http://scrollingtext.org/using-curl-and-user-agent-string-web-scraping-pt-2-now-php - Using curl and a user agent string for web scraping pt 2; Now with PHP 
[organic] - http://blog.amhill.net/2010/09/17/scraping-twitpics-with-php-coding/ - Scraping Twitpics with PHP [Coding] | Blog.amhill 
[organic] - http://corgitoergosum.net/2011/01/17/replicating-flipboard-part-i-site-scraping/ - Replicating Flipboard Part I â€“ Site Scraping | Cogito Ergo Sum 
[organic] - http://www.earthinfo.org/xpaths-with-php-by-example/ - XPaths with PHP by example Â« Earth Info 
[organic] - https://trac.transmissionbt.com/ticket/4158 - (scraping trackers of form "announce.php?key ... - Transmission 
[organic] - http://harmssite.com/2012/01/scraping-a-page-with-php - Scraping a page with php - HarmsSite 
[organic] - http://bytes.com/topic/php/answers/889713-blocking-php-curl-scraping-website-content - Blocking php curl from scraping website content - PHP - Bytes 
[organic] - http://blog.digitalmethods.net/2010/asimpletwitterscraper/ - A simple Twitter scraper - Digital Methods Initiative 
[organic] - http://www.satya-weblog.com/2010/11/play-with-yql-html-scraping-using-yql-and-php.html - Play with YQL: HTML Scraping using YQL and PHP - Satya's Weblog 
[organic] - https://www.e-education.psu.edu/geog863/l6_p6.html - Web Scraping | GEOG 863: Mashups - e-Education Institute 
[organic] - http://php.find-info.ru/php/010/phphks-CHP-5-SECT-12.html - PHP: Hack 44. Scrape Web Pages for Data 
[organic] - https://support.startpage.com/index.php?/Knowledgebase/Article/View/188/23/how-does-startpage-prevent-scraping-and-abuse-without-recording-ip-addresses - How does StartPage prevent scraping and abuse without recording ... 
[organic] - http://www.seerinteractive.com/blog/scraping-for-dummies-with-outwit-a-marketers-best-friend - Scraping for Dummies with Outwit (a Marketer's Best Friend) | SEER ... 
[organic] - http://health.mo.gov/lab/scabies.php - Skin Scraping Exam | State Public Health Laboratory | Health ... 
[organic] - http://junseewebdesigner.wordpress.com/2013/08/05/php-scrape-a-wordpress-feed/ - PHP Scrape a WordPress Feed | Junsee 
[organic] - http://blog.matthewdfuller.com/2012/07/defeating-x-frame-options-with-scraping.html - Matthew D Fuller - Blog: Defeating X-Frame-Options with Scraping 
[organic] - http://www.garysieling.com/blog/scraping-google-maps-search-results-with-javascript-and-php - Scraping Google Maps Search Results With Javascript And PHP ... 
[organic] - http://tellini.info/2011/05/scraping-mac-app-store-reviews/ - Scraping Mac App Store reviews | Simone Tellini 
[organic] - http://forums.thedailywtf.com/forums/p/8578/162940.aspx - Lame PHP Screen Scraping - TDWTF Forums 
[organic] - http://www.tutorialized.com/tutorial/Wikipedia-Content-Scraper-in-PHP/81662 - PHP Web Fetching Wikipedia Content Scraper in PHP Tutorial 
[organic] - http://www.dreamincode.net/forums/topic/9687-programatically-logging-in-and-page-scraping/ - Programatically Logging In And Page Scraping - PHP | Dream.In.Code 
[organic] - http://alexdglover.com/web-scraping-php-and-wheel-of-fortune/ - Alex D Glover Web Scraping, PHP, and Wheel of Fortune - Fun ... 
[organic] - http://itsrj.com/2010/12/24/scraping-sites-using-curl-xpath/ - Scraping Sites Using cURL & XPath | it's rj 
[organic] - http://scraperlab.com/ - ScraperLab | Web Scrapers Generator 
[organic] - http://www.gamegecko.com/game/204/scrape - Scrape - GameGecko.com 
[organic] - http://ask.metafilter.com/98518/Web-scraping-for-dummies - Web scraping for dummies - php mysql programming | Ask MetaFilter 
[organic] - http://kbeezie.com/scraping-google-results/ - Scraping Google Front Page Results Â» KBeezie 
[organic] - http://forums.thetvdb.com/viewtopic.php?f=4 - TheTVDB.com â€¢ View topic - 503 errors using the API / Errors ... 
[organic] - http://forums.devnetwork.net/viewtopic.php?f=1 - screen scraping a site which uses AJAX â€¢ PHP Developers Network 
[organic] - http://technoloid.blogspot.com/2012/03/screen-scraping.html - Screen Scraping Tumblr Using Curl | Technoloid 
[organic] - http://themanwhosoldtheweb.com/craigslist-email-scraper.php?tol - Craigslist Email Scraper - TheManWhoSoldtheWeb.com 
[organic] - http://forums.digitizedesign.com/topic/1604-beginner-scraping-script-with-php-and-curl/ - Beginner scraping script with PHP and cURL - PHP - Digitize Design 
[organic] - http://readwrite.com/2012/02/24/data-scraping-comes-of-age-wit - Data Scraping Comes of Age With ScraperWiki.com â€“ ReadWrite 
[organic] - http://www.binaryspark.com/classes/Art-of-the-scrape.pdf - Art of the scrape!!!! - BinarySpark.com 
[organic] - http://rhodesmill.org/brandon/chapters/screen-scraping/ - Chapter 10: Screen Scraping by Brandon Rhodes - Rhodes Mill 
[organic] - http://php.bigresource.com/Scraping-a-Secure-Site-3QvPycau.html - PHP :: Scraping A Secure Site 
[organic] - http://www.nickycakes.com/scraping-websites-for-fun-and-profit-part-2/ - Scraping Websites for Fun and Profit Part 2 | NickyCakes.com 
[organic] - http://books.google.com/books/about/PHP_Architect_s_Guide_to_Web_Scraping.html?id=H6O9cQAACAAJ - PHP-Architect's Guide to Web Scraping - Matthew Turland - Google ... 
[organic] - https://community.x10hosting.com/threads/php-xpath-scraping-data-from-a-page.101059/ - PHP - XPATH - Scraping Data From A Page | x10Hosting Community 
[organic] - http://nicklewis.org/node/962 - Stupid Simple Web Scraping with SimpleXML | Nick Lewis: The Blog 
[organic] - http://www.newthinktank.com/2010/11/python-2-7-tutorial-pt-13-website-scraping/ - Python 2.7 Tutorial Pt 13 Website Scraping - New Think Tank 
[organic] - http://byronwhitlock.com/FastCrawl/ - Whitlock Web Development - Fast Crawl PHP Web crawl framework 
[organic] - http://gablaxian.com/2013/06/18/scraping-twitter-feeds-with-nodejs.html - Scraping Twitter Feeds with NodeJS | gablaxian.com 
[organic] - http://programming.textures-tones.com/2012/01/30/basic-screen-scraping-part-1-basic-xml-parsing/ - Basic Screen Scraping â€“ Part 1, Basic XML Parsing | programming ... 
[organic] - http://www.nmdnet.org/2011/09/01/best-web-host-for-web-scraping-application/ - Best Web host for Web scraping application? Â» UMaine NMDNet 
[organic] - http://thewebscraping.com/web-scraper-open-source-3/ - Web scraper open source | The Web Scraping 
[organic] - http://skookum.com/blog/scraping-poorly-formatted-data-with-curl-and-phpquery/ - Scraping Poorly Formatted Data with cURL and phpQuery ... 
[organic] - http://www.customwebscraping.com/php-web-scraping - PHP Web Scraping | Andrade Global 
[organic] - http://www.lightspeedretail.com/blog/ - Retail Industry Blog â€“ LightSpeed Retail POS 
[organic] - https://www.distilled.net/blog/seo/building-your-own-scraper-for-link-analysis/ - Building Your Own Scraper for Link Analysis | Distilled 
[organic] - http://datajournalismhandbook.org/1.0/en/getting_data_3.html - Getting Data from the Web - The Data Journalism Handbook 
[organic] - http://jafty.com/blog/scraping-with-curl-using-cookies/ - Scraping with Curl using Cookies | Jafty Interactive Web Development 
[organic] - http://zrashwani.com/simple-web-spider-php-goutte/ - Simple web spider with PHP Goutte | Z.Rashwani Blog 
[organic] - http://blog.redbranch.net/2011/10/28/php-web-scraping-for-munin/ - PHP Web Scraping for Munin Â» Red Branch 
[organic] - http://answers.google.com/answers/threadview/id/785059.html - Google Answers: Webscraping and WebMacros software 
[organic] - http://www.topprojectshub.com/ - Outsourcing Data Entry, Data Scraping, Document Scanning, PHP ... 
[organic] - http://opensourcebridge.org/sessions/97 - Web Scraping with PHP / Open Source Bridge: The conference for ... 
[organic] - http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6183/pdf/imm6183.pdf - Algorithms for Web Scraping 
[organic] - http://www.armstrong-chemtec.com/rm/index.php?option=com_content - Scraped Surface Crystallizers 
[organic] - https://joind.in/435 - Talk: Web Scraping with PHP - Joind.in 
[organic] - https://thomashunter.name/blog/open-sourcing-my-php-web-scraper/ - Open Sourcing my PHP Web Scraper - Thomas Hunter II 
[organic] - http://ubuntuforums.org/showthread.php?t=1259548 - [other] PHP scraping Help - Ubuntu Forums 
[organic] - http://www.sunilb.com/php/writing-website-scrapers-in-php - Writing Website Scrapers in PHP | Geek Files 
[organic] - http://blog.ericlamb.net/2009/01/a-journey-into-php-cli-and-scraping/ - A journey into php-cli and scraping | Made of Everything You're Not 
[organic] - http://www.troywolf.com/articles/php/class_http/ - PHP class_http from Troy Wolf 
[organic] - http://tutorialzine.com/2013/02/24-cool-php-libraries-you-should-know-about/ - 24 Cool PHP Libraries You Should Know About | Tutorialzine 
[organic] - http://www.logaholic.de/2009/06/01/elegant-oop-html-scraping-with-domdocument/ - Elegant OOP HTML scraping with DOMDocument - Logaholic.de 
[organic] - http://capelinks.net/about/internet/spamdexing/ - Spamdexing: Scrape-O-Rama ~ CapeLinks Internet Services 
[organic] - http://www.proscraper.com/ - Professional Scraper - Website Scraping, Crawling, Data Mining ... 
[organic] - http://www.mrwebmaster.it/php/web-scraping-php_7568.html - Il Web Scraping in PHP | PHP | Mr.Webmaster 
[organic] - http://adamyoung.net/Quickstart-to-PHP-Screen-Scraping - Quickstart to PHP Screen Scraping | Adam Young 
[organic] - http://blog.svnlabs.com/craigslist-scraper-tool/ - Craigslist Scraper Tool | S V N Labs Softwares 
[organic] - http://www.codediesel.com/php/web-scraping-in-php-tutorial/ - Web scraping tutorial - CodeDiesel 
[organic] - http://curl.phptrack.com/forum/viewtopic.php?f=1 - CURL PHP Examples â€¢ View topic - Problem scraping url - PHP CURL ... 
[organic] - http://pp19dd.com/2009/11/php-algorithm-for-scraping-and-converting-a-twitter-list-into-rss-format-with-super-fancy-xpath-queries-in-six-awesomely-easy-steps/ - PHP algorithm for scraping and converting a twitter list into RSS ... 
[organic] - http://forum.phux.org/viewtopic.php?f=12 - phux Development â€¢ View topic - Data Scraping - MetaCritic.com 
[organic] - http://www.net-security.org/malware_news.php?id=1641 - Malware-driven pervasive memory scraping - Help Net Security 
[organic] - http://www.php-forum.com/phpforum/viewtopic.php?f=2 - www.php-forum.com â€¢ View topic - Site Scraping with PHP, HTML ... 
[organic] - http://lamp-dev.com/php-website-scraping-using-chrome-web-driver/635 - PHP Website Scraping using Chrome Web Driver | LAMPDev ... 
[organic] - http://www.ebookgoogle.com/633701-phparchitects-guide-web-scraping-php-repost - php|architect's Guide to Web Scraping with PHP (Repost) - Study ... 
[organic] - http://www.techsupportforum.com/forums/f49/php-screen-scraping-596493.html - PHP screen scraping - Tech Support Forum 
[organic] - http://board.issociate.de/thread/495564/Static-andor-Dynamic-site-scraping-using-PHP.html - Static and/or Dynamic site scraping using PHP 
[organic] - http://www.simplyhired.com/k-scraping-php-jobs.html - Scraping Php Jobs | Job Search with Simply Hired 
[organic] - http://www.script-home.com/php-multithreaded-scraping-of-the-page-implementation-code.html - PHP multithreaded scraping of the page implementation code ... 
[organic] - http://rottentomatoesdatascraping.blogspot.com/2013/05/managing-online-data-by-php-web-scraping.html - Managing Online Data by PHP Web Scraping - Rottentomatoes.com ... 
[organic] - http://www.freelancer.co.uk/projects/PHP-Software-Architecture/web-scraping-php-script.html - web scraping php script | PHP | Software Architecture 
[organic] - http://www.b.shuttle.de/hayek/Hayek/Jochen/wp/blog-en/2011/11/17/book-guide-to-web-scraping-with-php/ - book: Guide to Web Scraping with PHP | Jochen Hayek's Blog in ... 
[organic] - http://www.solveerrors.com/forums/scraping-an-aspx-site-with-php-33513.asp - Scraping an aspx site with php - SolveErrors.com 
[organic] - http://umuwa.com/php-web-scraping-script-download - php web scraping script download - at Umuwa 
[organic] - http://avaxsearch.com/?q=Web%20Scraping%20PHP - Web Scraping PHP - Data on AvaxHome 
[organic] - http://www.filestube.to/p2/php+architect+s+guide+to+web+scraping+with+php - Php architect s guide to web scraping with php download - FilesTube 
[organic] - http://efreedom.net/Question/1-34120/HTML-Scraping-Php - HTML Scraping in Php - efreedom 
[organic] - http://efreedom.net/Question/1-1332590/HTML-Comment-Scraping-PHP - HTML comment scraping in PHP - efreedom 
[organic] - http://www.getacoder.com/projects/view.php?id=144412 - Scraping PHP To Mysql Database (MySQL, PHP, PHP/IIS/MS SQL) 
[organic] - http://www.donanza.com/jobs/p3057980-php_scraping_php_mysql_scraping - Php Scraping - Php Mysql Scraping for Max. $500 - DoNanza 
[organic] - http://www.freelancer.is/projects/PHP-MySQL/Scraping-PHP-cURL-REGEX-Experts.html - Scraping, PHP, cURL, REGEX Experts | Data Mining ... - Freelancer.is 
[organic] - http://www.freelancer.in/job-search/web-scraping-php-simplexml-script/ - web scraping php simplexml script Freelancers and Jobs ... 
[organic] - http://www.freelancer.com.au/projects/PHP-Software-Architecture/Scraping-site-asp-php.html - Scraping site asp - php | PHP | Software Architecture 
[organic] - http://www.freelancer.co.za/projects/Perl/Scraping-site-asp-php-repost.html - Scraping site asp - php - repost | Perl - Freelancer.co.za 
[organic] - http://www.freelancer.com.bd/projects/PHP-Website-Design/PHP-script-for-data-scraping.html - PHP script for data scraping - Freelancer.com.bd 
[organic] - http://www.freelancer.ph/projects/PHP-MySQL/Web-Scraping-PHP-Preferred.html - Web Scraping (PHP Preferred) | Anything Goes | MySQL | PHP ... 
[organic] - http://www.freelancer.pk/projects/PHP-Web-Scraping/web-scraping-bot-submit-form.html - web scraping and bot to submit form iMacros or PHP | Data Mining ... 
[organic] - http://www.freelancer.com.jm/projects/PHP-Software-Architecture/Webpage-scraping-php-mysql-script.html - Webpage scraping php+mysql script - Freelancer.com.jm 
[organic] - http://coding.derkeiler.com/Archive/PHP/php.general/2005-11/msg00154.html - Re: Web Screen Scraping PHP Help 
[organic] - http://www.workingbase.com/project/PHP-login-to-a-website-programatically.2785673.html - PHP login to a website programatically (Javascript, PHP, Web ... 
[organic] - http://www.filestube.com/p/php+architect+s+guide+to+web+scraping - Php architect s guide to web scraping download - FilesTube 
[organic] - http://savedhistory.org/k/web-scraping-ebook-php - Web Scraping Ebook Php - savedwebhistory.org 
[organic] - http://hostcabi.net/websites/web-scraping-php - Web Scraping Php Websites - HostCabi.net 
[organic] - http://books.google.com/books?id=dqI-AQAAMAAJ - The Iron Age - Google Books Result 
[organic] - http://books.google.com/books?id=64I4AQAAMAAJ - The Literary Digest - Google Books Result 
[organic] - http://books.google.com/books?id=P54zAQAAMAAJ - Annual Report of the Pennsylvania Agricultural Experiment Station - Google Books Result 
[organic] - http://alaskagulfcoastexpeditions.com/tf/index.php?hl=lint+traps+for+dryers - Lint traps for dryers - Alaska Gulf Coast Expeditions 
[organic] - http://books.google.com/books?id=7W0-AQAAMAAJ - Harper's New Monthly Magazine - Google Books Result 
[organic] - http://www.trapperman.com/forum/ubbthreads.php/topics/4403841/all/First_Time_Fleshing_Beaver - First Time Fleshing Beaver | Trapper Talk | Trapperman.com Forums 
[organic] - http://books.google.com/books?id=nTYxAQAAMAAJ - Engineering - Google Books Result 
[organic] - http://books.google.com/books?id=pl8vAAAAYAAJ - The country - Google Books Result 
[organic] - http://en.wikipedia.org/wiki/Scrap - Scrap - Wikipedia, the free encyclopedia 
[organic] - http://forum.gamesports.net/dota/showthread.php?84583-Add-metadata-to-website - Add metadata to website 
[organic] - http://forum.the-west.net/showthread.php?p=716823 - The Tiran Wars: Liberty, at all Costs - Page 83 - Forum The West 
[organic] - http://www.horseandhound.co.uk/forums/showthread.php?659234-Following-on-from-the-weaving-thread - Following on from the weaving thread - Horse and Hound 
[organic] - http://forums.digitalspy.co.uk/showthread.php?p=71966376 - Why do people still buy watches? - Page 15 - General Discussion ... 
[organic] - http://washingtondc.craigslist.org/doc/cps/4391028736.html - Database and application development asp.net php - Craigslist 
[organic] - http://forum.bodybuilding.com/index.php - Bodybuilding.com Forums - Bodybuilding And Fitness Board 
[organic] - http://www.disboards.com/showthread.php?p=51078875 - David's DVC rental and MDE?? - The DIS Discussion Forums ... 
[organic] - http://worldoftanks.mmmos.com/?page=view - Side scraping, a good example - World of Tanks - MMMOs 
[organic] - http://www.redpowermagazine.com/forums/index.php?showtopic=85925 - Finally made something out of myself. - Page 2 - Coffee Shop - Red ... 
[organic] - http://www.redpowermagazine.com/forums/index.php?showtopic=85956 - mudslide in Washington state - Page 2 - Coffee Shop - Red Power ... 
[organic] - http://www.dice.com/job/result/10531322/517235?src=19 - PHP Developer - Aqua Systems Inc - Roslyn, NY | dice.com - 3-28 ... 
[organic] - http://forums.winamp.com/showthread.php?p=2988914 - Are skins lost? - Winamp Forums 
[organic] - http://kumb.com/forum/viewtopic.php?f=2 - Knees Up Mother Brown - West Ham United FC Online: Forum â€¢ View ... 
[organic] - http://www.wbaunofficial.org.uk/forum/showthread.php?tid=24834 - Fulham and Cardiff gone for me 
[organic] - http://abierta.cl/index.php/abierta-act/areas/itemlist/user/706-joomlayldo - joomlayldo - Comunidad Abierta Arte, Ciencia y TecnologÃ­a 
[organic] - http://forums.probetalk.com/showthread.php?s=5365733991fb268c77b6d46da2f40edb - Detailing KLG4. How deep to I go? - ProbeTalk.com Forums 

Requirements: * PHP 5.2 or higher, PHP libCURL and PHP DOM * user permissions to write at the local directory (caching) * us proxies API support (professional IP provider)

Download the source code here: search-engine-scraper.php functions-ses.php simple_html_dom.php

search-engine-scraper.php


#!/usr/bin/php
<?php
    /* License: 
       Open source for private and commercial use but this comment needs to stay untouched on top.
       URL of original source code: http://scraping.compunect.com
       Author of original source code: http://www.compunect.com
       IP rotation API code from here: http://www.us-proxies.com/automate
       Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall the Licensor be liable to anyone for any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or the use of the Original Work including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.
       Usage exceptions:
       Public redistributing modifications of this source code project is not allowed without written agreement.
       Using this work for private and commercial projects is allowed, redistributing it is not allowed without our written agreement.
     */

    ini_set("memory_limit","64M"); // For scraping 100 results pages 32MB memory expected, for scraping the default 10 results pages 4MB are expected. 64MB is selected just in case.
    ini_set("xdebug.max_nesting_level","2000"); // precaution, might not be required. our parser will require a deep nesting level but I did not check how deep a 100 result page actually is.
    error_reporting(E_ALL & ~E_NOTICE);
    // ************************* Configuration variables *************************
    // Your api credentials, you need a plan at us-proxies.com
    // It's optional, you can remove the proxy related parts and just use it as a single-IP tool. Just make sure to implement a request delay of around 3-5 minutes in that case.
    $pwd = your-key;
    $uid = your-account-id;

    // General configuration
    $test_website_url = "website.com"; // The URL, or a sub-string of it, of the indexed website.
    $test_keywords = "keyword,another keyword,more keywords"; // comma separated keywords to test the rank for
    $test_max_pages = 3; // The number of result pages to test until giving up per keyword.
    $test_100_resultpage = 0; // Warning: Google ranking results may  become inaccurate

    /* Local result configuration. Enter 'help' to receive a list of possible choices. use global and en for the default worldwide results in english 
     * You need to define a country as well as the language. Visit the Google domain of the specific country to see the available languages.
     * Only a correct combination of country and language will return the correct search engine result pages. */
    $test_country = "global"; // Country code. "global" is default. Use "help" to receive a list of available codes. [com,us,uk,fr,de,...]
    $test_language = "en"; // Language code. "EN" is default Use "help" to receive a list. Visit the local Google domain to find available langauges of that domain. [en,fr,de,...]
    $filter = 1; // 0 for no filter (recommended for maximizing content), 1 for normal filter (recommended for accuracy)
    $force_cache = 0; // set this to 1 if you wish to force the loading of cache files, even if the files are older than 24 hours. Set to -1 if you wish to force a new scrape.
    $load_all_ranks = 1; /* set this to 0 if you wish to stop scraping once the $test_website_url has been found in the search engine results,
                         * if set to 1 all $test_max_pages will be downloaded. This might be useful for more detailed ranking analysis.*/

    $show_html = 0; // 1 means: output formated with HTML tags. 0 means output for console (recommended script usage)
    $show_all_ranks = 1; // set to 1 to display a complete list of all ranks per keyword, set to 0 to only display the ranks for the specified website
    // ***************************************************************************
    $working_dir = "./local_cache"; // local directory. This script needs permissions to write into it


    require "functions-ses.php";


$page = 0;
$PROXY = array(); // after the rotate api call this variable contains these elements: [address](proxy host),[port](proxy port),[external_ip](the external IP),[ready](0/1)
$PLAN = array();
$results = array();


if ($show_html) $NL = "<br>\n"; else $NL = "\n";
if ($show_html) $HR = "<hr>\n"; else $HR = "_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_\n";
if ($show_html) $B = "<b>"; else $B = "!";
if ($show_html) $B_ = "</b>"; else $B_ = "!";


/*
 * Start of main()
 */

if ($show_html)
{
    echo "<html><body>";
}

$keywords = explode(",", $test_keywords);
if (!count($keywords)) die ("Error: no keywords defined.$NL");
if (!rmkdir($working_dir)) die("Failed to create/open $working_dir$NL");

$country_data = get_google_cc($test_country, $test_language);
if (!$country_data) die("Invalid country/language code specified.$NL");


$ready = get_license();
if (!$ready) die("The specified API key account for user $uid is not active or invalid. $NL");
if ($PLAN['protocol'] != "http") die("Wrong proxy protocol configured, switch to HTTP. $NL");

echo "$NL$B Search Engine Scraper for $test_website_url initated $B_ $NL$NL";

/*
 * This loop iterates through all keyword combinations
 */
$ch = NULL;
$rotate_ip = 0; // variable that triggers an IP rotation (normally only during keyword changes)
$max_errors_total = 3; // abort script if there are 3 keywords that can not be scraped (something is going wrong and needs to be checked)

$rank_data = array();
$siterank_data = array();

$break=0; // variable used to cancel loop without losing ranking data
foreach ($keywords as $keyword)
{
    $rank = 0;
    $max_errors_page = 5; // abort script if there are 5 errors in a row, that should not happen

    if ($test_max_pages <= 0) break;
    $search_string = urlencode($keyword);
    $rotate_ip = 1; // IP rotation for each new keyword

    /*
    * This loop iterates through all result pages for the given keyword
    */
    for ($page = 0; $page < $test_max_pages; $page++)
    {
        $serp_data = load_cache($search_string, $page, $country_data, $force_cache); // load results from local cache if available for today
        $maxpages = 0;

        if (!$serp_data)
        {
            $ip_ready = check_ip_usage(); // test if ip has not been used within the critical time
            while (!$ip_ready || $rotate_ip)
            {
                $ok = rotate_proxy(); // start/rotate to the IP that has not been started for the longest time, also tests if proxy connection is working
                if ($ok != 1)
                {
                    die ("Fatal error: proxy rotation failed:$NL $ok$NL");
                }
                $ip_ready = check_ip_usage(); // test if ip has not been used within the critical time
                if (!$ip_ready)
                {
                    die("ERROR: No fresh IPs left, try again later. $NL");
                } else
                {
                    $rotate_ip = 0; // ip rotated
                    break; // continue
                }
            }

            delay_time(); // stop scraping based on the license size to spread scrapes best possible and avoid detection
            global $scrape_result; // contains metainformation from the scrape_serp_google() function
            $raw_data = scrape_google($search_string, $page, $country_data); // scrape html from search engine
            if ($scrape_result != "SCRAPE_SUCCESS")
            {
                if ($max_errors_page--)
                {
                    echo "There was an error scraping (Code: $scrape_result), trying again .. $NL";
                    $page--;
                    continue;
                } else
                {
                    $page--;
                    if ($max_errors_total--)
                    {
                        echo "Too many errors scraping keyword $search_string (at page $page). Skipping remaining pages of keyword $search_string .. $NL";
                        break;
                    } else
                    {
                        die ("ERROR: Max keyword errors reached, something is going wrong. $NL");
                    }
                    break;
                }
            }
            mark_ip_usage(); // store IP usage, this is very important to avoid detection and gray/blacklistings
            global $process_result; // contains metainformation from the process_raw() function
            $serp_data = process_raw_v2($raw_data, $page); // process the html and put results into $serp_data

            if (($process_result == "PROCESS_SUCCESS_MORE") || ($process_result == "PROCESS_SUCCESS_LAST"))
            {
                $result_count = count($serp_data);
                $serp_data['page'] = $page;
                if ($process_result != "PROCESS_SUCCESS_LAST")
                {
                    $serp_data['lastpage'] = 1;
                } else
                {
                    $serp_data['lastpage'] = 0;
                }
                $serp_data['keyword'] = $keyword;
                $serp_data['cc'] = $country_data['cc'];
                $serp_data['lc'] = $country_data['lc'];
                $serp_data['result_count'] = $result_count;
                store_cache($serp_data, $search_string, $page, $country_data); // store results into local cache
            }

            if ($process_result != "PROCESS_SUCCESS_MORE")
            {
                $break=1;
                //break;
            } // last page
            if (!$load_all_ranks)
            {
                for ($n = 0; $n < $result_count; $n++)
                    if (strstr($results[$n]['url'], $test_website_url))
                    {
                        verbose("Located $test_website_url within search results.$NL");
                        $break=1;
                        //break;
                    }
            }

        } // scrape clause

        $result_count = $serp_data['result_count'];

        for ($ref = 0; $ref < $result_count; $ref++)
        {
            $rank++;
            $rank_data[$keyword][$rank]['title'] = $serp_data[$ref]['title'];
            $rank_data[$keyword][$rank]['url']  = $serp_data[$ref]['url'];
            $rank_data[$keyword][$rank]['host'] = $serp_data[$ref]['host'];
            $rank_data[$keyword][$rank]['desc'] = $serp_data[$ref]['desc'];
            $rank_data[$keyword][$rank]['type'] = $serp_data[$ref]['type'];
            //$rank_data[$keyword][$rank]['desc']=$serp_data['desc'']; // not really required
            if (strstr($rank_data[$keyword][$rank]['url'], $test_website_url))
            {
                $info = array();
                $info['rank'] = $rank;
                $info['url'] = $rank_data[$keyword][$rank]['url'];
                $siterank_data[$keyword][] = $info;
            }
        }
        if ($break == 1) break;

    } // page loop
} // keyword loop

if ($show_all_ranks)
{
    foreach ($rank_data as $keyword => $ranks)
    {
        echo "$NL$NL$B" . "Ranking information for keyword \"$keyword\" $B_$NL";
        echo "$B" . "Rank [Type] - Website -  Title$B_$NL";
        $pos = 0;
        foreach ($ranks as $rank)
        {
            $pos++;
            if (strstr($rank['url'], $test_website_url))
            {
                echo "$B$pos [$rank[type]] - $rank[url] - $rank[title] $B_$NL";
//                    echo $rank['desc']."\n";
            } else
            {
                echo "$pos [$rank[type]] - $rank[url] - $rank[title] $NL";
//                    echo $rank['desc']."\n";
            }
        }
    }
}


foreach ($keywords as $keyword)
{
    if (!isset($siterank_data[$keyword]))
    {
        echo "$NL$B" . "The specified site was not found in the search results for keyword \"$keyword\". $B_$NL";
    } else
    {
        $siteranks = $siterank_data[$keyword];
        echo "$NL$NL$B" . "Ranking information for keyword \"$keyword\" and website \"$test_website_url\" [$test_country / $test_language] $B_$NL";
        foreach ($siteranks as $siterank)
            echo "Rank $siterank[rank] for URL $siterank[url]$NL";
    }
}
//var_dump($siterank_data);


if ($show_html)
{
    echo "</body></html>";
}



?>

functions-ses.php


<?PHP
    /* License: 
       Open source for private and commercial use but this comment needs to stay untouched on top.
       URL of original source code: http://scraping.compunect.com
       Author of original source code: http://www.compunect.com
       IP rotation API code from here: http://www.us-proxies.com/automate
       Under no circumstances and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall the Licensor be liable to anyone for any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or the use of the Original Work including, without limitation, damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses. This limitation of liability shall not apply to the extent applicable law prohibits such limitation.
       Usage exceptions:
       Public redistributing modifications of this source code project is not allowed without written agreement.
       Using this work for private and commercial projects is allowed, redistributing it is not allowed without our written agreement.
     */

    function verbose($text)
    {
        echo $text;
    }

    /*
     * By default (no force) the function will load cached data within 24 hours otherwise reject the cache.
     * Google does not change its ranking too frequently, that's why 24 hours has been chosen.
     *
     * Multithreading: When multithreading you need to work on a proper locking mechanism
     */
    function load_cache($search_string, $page, $country_data, $force_cache)
    {
        global $working_dir;
        global $NL;
        global $test_100_resultpage;

        if ($force_cache < 0) return NULL;
        $lc = $country_data['lc'];
        $cc = $country_data['cc'];
        if ($test_100_resultpage)
        {
            $hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page . ".100p");
        } else
        {
            $hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page);
        }
        $file = "$working_dir/$hash.cache";
        $now = time();
        if (file_exists($file))
        {
            $ut = filemtime($file);
            $dif = $now - $ut;
            $hour = (int)($dif / (60 * 60));
            if ($force_cache || ($dif < (60 * 60 * 24)))
            {
                $serdata = file_get_contents($file);
                $serp_data = unserialize($serdata);
                verbose("Cache: loaded file $file for $search_string and page $page. File age: $hour hours$NL");

                return $serp_data;
            }

            return NULL;
        } else
        {
            return NULL;
        }

    }

    /*
     * Multithreading: When multithreading you need to work on a proper locking mechanism
     */
    function store_cache($serp_data, $search_string, $page, $country_data)
    {
        global $working_dir;
        global $NL;
        global $test_100_resultpage;

        $lc = $country_data['lc'];
        $cc = $country_data['cc'];
        if ($test_100_resultpage)
        {
            $hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page . ".100p");
        } else
        {
            $hash = md5($search_string . "_" . $lc . "_" . $cc . "." . $page);
        }
        $file = "$working_dir/$hash.cache";
        $now = time();
        if (file_exists($file))
        {
            $ut = filemtime($file);
            $dif = $now - $ut;
            if ($dif < (60 * 60 * 24)) echo "Warning: cache storage initated for $search_string page $page which was already cached within the past 24 hours!$NL";
        }
        $serdata = serialize($serp_data);
        file_put_contents($file, $serdata, LOCK_EX);
        verbose("Cache: stored file $file for $search_string and page $page.$NL");
    }

    // check_ip_usage() must be called before first use of mark_ip_usage()
    function check_ip_usage()
    {
        global $PROXY;
        global $working_dir;
        global $NL;
        global $ip_usage_data; // usage data object as array

        if (!isset($PROXY['ready'])) return 0; // proxy not ready/started
        if (!$PROXY['ready']) return 0; // proxy not ready/started

        if (!isset($ip_usage_data))
        {
            if (!file_exists($working_dir . "/ipdata.obj")) // usage data object as file
            {
                echo "Warning!$NL" . "The ipdata.obj file was not found, if this is the first usage of the rank checker everything is alright.$NL" . "Otherwise removal or failure to access the ip usage data will lead to damage of the IP quality.$NL$NL";
                sleep(5);
                $ip_usage_data = array();
            } else
            {
                $ser_data = file_get_contents($working_dir . "/ipdata.obj");
                $ip_usage_data = unserialize($ser_data);
            }
        }

        if (!isset($ip_usage_data[$PROXY['external_ip']]))
        {
            verbose("IP $PROXY[external_ip] is ready for use $NL");

            return 1; // the IP was not used yet
        }
        if (!isset($ip_usage_data[$PROXY['external_ip']]['requests'][20]['ut_google']))
        {
            verbose("IP $PROXY[external_ip] is ready for use $NL");

            return 1; // the IP has not been used 20+ times yet, return true
        }
        $ut_last = (int)$ip_usage_data[$PROXY['external_ip']]['ut_last-usage']; // last time this IP was used
        $req_total = (int)$ip_usage_data[$PROXY['external_ip']]['request-total']; // total number of requests made by this IP
        $req_20 = (int)$ip_usage_data[$PROXY['external_ip']]['requests'][10]['ut_google']; // the 20th request (if IP was used 20+ times) unixtime stamp [changed to 10 due to Google issues]

        $now = time();
        if (($now - $req_20) > (60 * 60))
        {
            verbose("IP $PROXY[external_ip] is ready for use $NL");

            return 1; // more than an hour passed since 20th usage of this IP [changed to 10]
        } else
        {
            $cd_sec = (60 * 60) - ($now - $req_20);
            verbose("IP $PROXY[external_ip] needs $cd_sec seconds cooldown, not ready for use yet $NL");

            return 0; // the IP is overused, it can not be used for scraping without being detected by the search engine yet
        }

    }


    // return 1 if license is ready, otherwise 0
    function get_license()
    {
        global $uid;
        global $pwd;
        global $PLAN;
        global $NL;

        $res = ip_service("plan");
        $ip = "";
        if ($res <= 0)
        {
            verbose("API error: Proxy API connection failed (Error $res). trying again later..$NL$NL");

            return 0;
        } else
        {
            ($PLAN['active'] == 1) ? $ready = "active" : $ready = "not active";
            verbose("API success: Account is $ready.$NL");
            if ($PLAN['active'] == 1) return 1;

            return 0;
        }

        return $PLAN;
    }

    /* Delay (sleep) based on the license size to allow optimal scraping
     *
     * Warning!
     * Do NOT change the delay to be shorter than the specified delay.
     * When scraping Google you should never do more than 20 requests per hour per IP address
     * The recommended value is 10, if you must go higher you can go up to 20 but I'd stay lower
     * This function will create a delay based on your total IP addresses.
     *
     * Together with the IP management functions this will ensure that your IPs stay healthy (no wrong rankings) and undetected (no virus warnings, blacklists, captchas)
     *
     * Multithreading:
     * When multithreading you need to multiply the delay time ($d) by the number of threads
     *
     * Due to Google getting stricter and stricter you might even have to lower the rate.
     */
    function delay_time()
    {
        global $NL;
        global $PLAN;

        $d = (3600 * 1000000 / (((float)$PLAN['total_ips']) * 10));
        verbose("Delay based on plan size.. $NL");
        usleep($d);
    }

    /*
     * Updates and stores the ip usage data object
     * Marks an IP as used and re-sorts the access array 
     */
    function mark_ip_usage()
    {
        global $PROXY;
        global $working_dir;
        global $NL;
        global $ip_usage_data; // usage data object as array

        if (!isset($ip_usage_data)) die("ERROR: Incorrect usage. check_ip_usage() needs to be called once before mark_ip_usage()!$NL");
        $now = time();

        $ip_usage_data[$PROXY['external_ip']]['ut_last-usage'] = $now; // last time this IP was used
        if (!isset($ip_usage_data[$PROXY['external_ip']]['request-total'])) $ip_usage_data[$PROXY['external_ip']]['request-total'] = 0;
        $ip_usage_data[$PROXY['external_ip']]['request-total']++; // total number of requests made by this IP
        // shift fifo queue
        for ($req = 19; $req >= 1; $req--)
        {
            if (isset($ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google']))
            {
                $ip_usage_data[$PROXY['external_ip']]['requests'][$req + 1]['ut_google'] = $ip_usage_data[$PROXY['external_ip']]['requests'][$req]['ut_google'];
            }
        }
        $ip_usage_data[$PROXY['external_ip']]['requests'][1]['ut_google'] = $now;

        $serdata = serialize($ip_usage_data);
        file_put_contents($working_dir . "/ipdata.obj", $serdata, LOCK_EX);

    }


    // access google based on parameters and return raw html or "0" in case of an error
    function scrape_google($search_string, $page, $local_data)
    {
        global $ch;
        global $NL;
        global $PROXY;
        global $PLAN;
        global $scrape_result;
        global $test_100_resultpage;
        global $filter;
        $scrape_result = "";

        $google_ip = $local_data['domain'];
        $hl = $local_data['lc'];

        if ($page == 0)
        {
            if ($test_100_resultpage)
            {
                $url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=100&filter=$filter";
            } else
            {
                $url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&num=10&filter=$filter";
            }
        } else
        {

            if ($test_100_resultpage)
            {
                $num = $page * 100;
                $url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=100&filter=$filter";
            } else
            {
                $num = $page * 10;
                $url = "http://$google_ip/search?q=$search_string&hl=$hl&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:official&client=firefox&start=$num&num=10&filter=$filter";
            }
        }
        //verbose("Debug, Search URL: $url$NL");

        curl_setopt($ch, CURLOPT_URL, $url);
        $htmdata = curl_exec($ch);
        if (!$htmdata)
        {
            $error = curl_error($ch);
            $info = curl_getinfo($ch);
            echo "\tError scraping: $error [ $error ]$NL";
            $scrape_result = "SCRAPE_ERROR";
            sleep(3);

            return "";
        } else
        {
            if (strlen($htmdata) < 20)
            {
                $scrape_result = "SCRAPE_EMPTY_SERP";
                sleep(3);

                return "";
            }
        }


        if (strstr($htmdata, "computer virus or spyware application"))
        {
            echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
            $scrape_result = "SCRAPE_DETECTED";
            die();
        }
        if (strstr($htmdata, "entire network is affected"))
        {
            echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
            $scrape_result = "SCRAPE_DETECTED";
            die();
        }
        if (strstr($htmdata, "http://www.download.com/Antivirus"))
        {
            echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
            $scrape_result = "SCRAPE_DETECTED";
            die();
        }
        if (strstr($htmdata, "/images/yellow_warning.gif"))
        {
            echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
            $scrape_result = "SCRAPE_DETECTED";
            die();
        }
        if (strstr($htmdata, "This page appears when Google automatically detects requests coming from your computer network"))
        {
            echo("Google blocked us, we need more proxies ! Make sure you did not damage the IP management functions. Consider changing keywords and lowering request rates. $NL");
            $scrape_result = "SCRAPE_DETECTED";
            die();
        }
        $scrape_result = "SCRAPE_SUCCESS";

        return $htmdata;
    }

    require_once "simple_html_dom.php";
    function process_raw_v2($data, $page)
    {
        global $process_result; // contains metainformation from the process_raw() function
        global $test_100_resultpage;
        global $NL;
        global $B;
        global $B_;
        $results=array();

        $html = new simple_html_dom();
        $html->load($data);
        /** @var $interest simple_html_dom_node */
        $interest = $html->find('div#ires ol div.g');
        echo "found interesting elements: ".count($interest)."\n";
        $interest_num=0;
        foreach ($interest as $li)
        {
            $result = array('title'=>'undefined','host'=>'undefined','url'=>'undefined','desc'=>'undefined','type'=>'organic');
            $interest_num ++;
            $h3 = $li->find('h3.r',0);
            if (!$h3)
            {
                continue;
            }
            $a = $h3->find('a',0);
            if (!$a) continue;
            $result['title'] = html_entity_decode($a->plaintext);
            $lnk = urldecode($a->href);
            if ($lnk)
            {
                preg_match('/(ht[^&]*)/', $lnk, $m);
                if ($m && $m[1])
                {
                    $result['url']=$m[1];
                    $tmp=parse_url($m[1]);
                    $result['host']=$tmp['host'];
                } else
                {
                    if (strstr($result['title'],'News')) $result['type']='news';
                    if (strstr($result['title'],'Images')) $result['type']='images';
                }
            }
            if ($result['type']=='organic')
            {
                $sp = $li->find('span.st',0);
                if ($sp)
                {
                    $result['desc']=html_entity_decode($sp->plaintext);
                    $sp->clear();
                }
            }
            $h3->clear();
            $a->clear();
            $li->clear();
            $results[]=$result;
        }
        $html->clear;





        // Analyze if more results are available (next page)
        $next = 0;
        if (strstr($data, "Next</a>"))
        {
            $next = 1;
        } else
        {
            if ($test_100_resultpage)
            {
                $needstart = ($page + 1) * 100;
            } else
            {
                $needstart = ($page + 1) * 10;
            }
            $findstr = "start=$needstart";
            if (strstr($data, $findstr)) $next = 1;
        }
        $page++;
        if ($next)
        {
            $process_result = "PROCESS_SUCCESS_MORE"; // more data available
        } else
        {
            $process_result = "PROCESS_SUCCESS_LAST";
        } // last page reached

        return $results;
    }

    function rotate_proxy()
    {
        global $PROXY;
        global $ch;
        global $NL;
        $max_errors = 3;
        $success = 0;
        while ($max_errors--)
        {
            $res = ip_service("rotate"); // will fill $PROXY
            $ip = "";
            if ($res <= 0)
            {
                verbose("API error: Proxy API connection failed (Error $res). trying again soon..$NL$NL");
                sleep(21); // retry after a while
            } else
            {
                verbose("API success: Received proxy IP $PROXY[external_ip] on port $PROXY[port]$NL");
                $success = 1;
                break;
            }
        }
        if ($success)
        {
            $ch = new_curl_session($ch);

            return 1;
        } else
        {
            return "API rotation failed. Check license, firewall and API credentials.$NL";
        }
    }


    function extractBody($response_str)
    {
        $parts = preg_split('|(?:\r?\n){2}|m', $response_str, 2);
        if (isset($parts[1])) return $parts[1];

        return '';
    }

    /*
     * This is the API function to retrieve US IP addresses
     * On success this function will define the global $PROXY variable, adding the elements ready,address,port,external_ip and return 1
     * On failure the return is 0 or smaller and the PROXY variable ready element is set to "0"
     * To obtain a plan please check out us-proxies.com, this can often be handled within a day
     */

    function ip_service($cmd, $x = "")
    {
        global $pwd;
        global $uid;
        global $PROXY;
        global $PLAN;
        global $NL;

        $fp = fsockopen("us-proxies.com", 80);
        if (!$fp)
        {
            echo "Unable to connect to API $NL";

            return -1; // connection not possible
        } else
        {
            if ($cmd == "plan")
            {
                fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=plan&extended=1 HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");

                stream_set_timeout($fp, 8);
                $res = "";
                $n = 0;
                while (!feof($fp))
                {
                    if ($n++ > 4) break;
                    $res .= fread($fp, 8192);
                }
                $info = stream_get_meta_data($fp);
                fclose($fp);

                if ($info['timed_out'])
                {
                    echo 'API: Connection timed out! $NL';
                    $PLAN['active'] = 0;

                    return -2; // api timeout
                } else
                {
                    if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
                    $data = extractBody($res);
                    $ar = explode(":", $data);
                    if (count($ar) < 4) return -100; // invalid api response
                    switch ($ar[0])
                    {
                        case "ERROR":
                            echo "API Error: $res $NL";
                            $PLAN['active'] = 0;

                            return 0; // Error received
                            break;
                        case "PLAN":
                            $PLAN['max_ips'] = $ar[1]; // number of IPs licensed
                            $PLAN['total_ips'] = $ar[2]; // number of IPs assigned
                            $PLAN['protocol'] = $ar[3]; // current proxy protocol (http, socks, ..)
                            $PLAN['processes'] = $ar[4]; // number of available proxy processes
                            if ($PLAN['total_ips'] > 0) $PLAN['active'] = 1; else $PLAN['active'] = 0;

                            return 1;
                            break;
                        default:
                            echo "API Error: Received answer $ar[0], expected \"PLAN\"";
                            $PLAN['active'] = 0;

                            return -101; // unknown API response
                    }
                }

            } // cmd==plan


            if ($cmd == "rotate")
            {
                $PROXY['ready'] = 0;
                fwrite($fp, "GET /api.php?api=1&uid=$uid&pwd=$pwd&cmd=rotate&randomness=0&offset=0 HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
                stream_set_timeout($fp, 8);
                $res = "";
                $n = 0;
                while (!feof($fp))
                {
                    if ($n++ > 4) break;
                    $res .= fread($fp, 8192);
                }
                $info = stream_get_meta_data($fp);
                fclose($fp);

                if ($info['timed_out'])
                {
                    echo 'API: Connection timed out! $NL';

                    return -2; // api timeout
                } else
                {
                    if (strlen($res) > 1000) return -3; // invalid api response (check the API website for possible problems)
                    $data = extractBody($res);
                    $ar = explode(":", $data);
                    if (count($ar) < 4) return -100; // invalid api response
                    switch ($ar[0])
                    {
                        case "ERROR":
                            echo "API Error: $res $NL";

                            return 0; // Error received
                            break;
                        case "ROTATE":
                            $PROXY['address'] = $ar[1];
                            $PROXY['port'] = $ar[2];
                            $PROXY['external_ip'] = $ar[3];
                            $PROXY['ready'] = 1;
                            usleep(230000); // additional time to avoid connecting during proxy bootup phase, removing this can cause random connection failures but will increase overall performance for large IP licenses
                            return 1;
                            break;
                        default:
                            echo "API Error: Received answer $ar[0], expected \"ROTATE\"";

                            return -101; // unknown API response
                    }
                }
            } // cmd==rotate
        }
    }




    function getip()
    {
        global $PROXY;
        if (!$PROXY['ready']) return -1; // proxy not ready

        $curl_handle = curl_init();
        curl_setopt($curl_handle, CURLOPT_URL, 'http://ipcheck.ipnetic.com/remote_ip.php'); // returns the real IP
        curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 10);
        curl_setopt($curl_handle, CURLOPT_TIMEOUT, 10);
        curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
        $curl_proxy = "$PROXY[address]:$PROXY[port]";
        curl_setopt($curl_handle, CURLOPT_PROXY, $curl_proxy);
        $tested_ip = curl_exec($curl_handle);

        if (preg_match("^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}^", $tested_ip))
        {
            curl_close($curl_handle);

            return $tested_ip;
        } else
        {
            $info = curl_getinfo($curl_handle);
            curl_close($curl_handle);

            return 0; // possible error would be a wrong authentication IP or a firewall
        }
    }


    function new_curl_session($ch = NULL)
    {
        global $PROXY;
        if ((!isset($PROXY['ready'])) || (!$PROXY['ready'])) return $ch; // proxy not ready

        if (isset($ch) && ($ch != NULL))
        {
            curl_close($ch);
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $curl_proxy = "$PROXY[address]:$PROXY[port]";
        curl_setopt($ch, CURLOPT_PROXY, $curl_proxy);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
        curl_setopt($ch, CURLOPT_TIMEOUT, 20);
        curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en; rv:1.9.0.4) Gecko/2009011913 Firefox/3.0.6");



        return $ch;
    }


    function rmkdir($path, $mode = 0755)
    {
        if (file_exists($path)) return 1;

        return @mkdir($path, $mode);
    }


    /*
     * For country&language specific searches
     * The identifier codes require an active plan at us-proxies.com
     * If you plan to omit the IP service just replace that part too or do not use language specifications at all
     */
    function get_google_cc($cc, $lc)
    {
        global $pwd;
        global $uid;
        global $PROXY;
        global $PLAN;
        global $NL;
        $fp = fsockopen("us-proxies.com", 80);
        if (!$fp)
        {
            echo "Unable to connect to google_cc API of us-proxies.com $NL";

            return NULL; // connection not possible
        } else
        {
//            echo("GET /g_api.php?api=1&uid=$uid&pwd=$pwd&cmd=google_cc&cc=$cc&lc=$lc HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
            fwrite($fp, "GET /g_api.php?api=1&uid=$uid&pwd=$pwd&cmd=google_cc&cc=$cc&lc=$lc HTTP/1.0\r\nHost: us-proxies.com\r\nAccept: text/html, text/plain, text/*, */*;q=0.01\r\nAccept-Encoding: plain\r\nAccept-Language: en\r\n\r\n");
            stream_set_timeout($fp, 8);
            $res = "";
            $n = 0;
            while (!feof($fp))
            {
                if ($n++ > 4) break;
                $res .= fread($fp, 8192);
            }
            $info = stream_get_meta_data($fp);
            fclose($fp);

            if ($info['timed_out'])
            {
                echo 'API: Connection timed out! $NL';

                return NULL; // api timeout
            } else
            {
                $data = extractBody($res);
                $obj = unserialize($data);
                if (isset($obj['error'])) echo $obj['error'] . "$NL";
                if (isset($obj['info'])) echo $obj['info'] . "$NL";

                return $obj['data'];

                if (strlen($data) < 4) return NULL; // invalid api response
            }
        }
    }


?>

simple_html_dom.php


<?php

/**

 * Website: http://sourceforge.net/projects/simplehtmldom/

 * Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)

 * Contributions by:

 *     Yousuke Kumakura (Attribute filters)

 *     Vadim Voituk (Negative indexes supports of "find" method)

 *     Antcs (Constructor with automatically load contents either text or file/url)

 *

 * all affected sections have comments starting with "PaperG"

 *

 * Paperg - Added case insensitive testing of the value of the selector.

 * Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.

 *  This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,

 *  it will almost always be smaller by some amount.

 *  We use this to determine how far into the file the tag in question is.  This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.

 *  but for most purposes, it's a really good estimation.

 * Paperg - Added the forceTagsClosed to the dom constructor.  Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.

 * Allow the user to tell us how much they trust the html.

 * Paperg add the text and plaintext to the selectors for the find syntax.  plaintext implies text in the innertext of a node.  text implies that the tag is a text node.

 * This allows for us to find tags based on the text they contain.

 * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.

 * Paperg: added parse_charset so that we know about the character set of the source document.

 *  NOTE:  If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the

 *  last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.

 *

 * Found infinite loop in the case of broken html in restore_noise.  Rewrote to protect from that.

 * PaperG (John Schlick) Added get_display_size for "IMG" tags.

 *

 * Licensed under The MIT License

 * Redistributions of files must retain the above copyright notice.

 *

 * @author S.C. Chen <me578022@gmail.com>

 * @author John Schlick

 * @author Rus Carroll

 * @version 1.5 ($Rev: 196 $)

 * @package PlaceLocalInclude

 * @subpackage simple_html_dom

 */



/**

 * All of the Defines for the classes below.

 * @author S.C. Chen <me578022@gmail.com>

 */

define('HDOM_TYPE_ELEMENT', 1);

define('HDOM_TYPE_COMMENT', 2);

define('HDOM_TYPE_TEXT',    3);

define('HDOM_TYPE_ENDTAG',  4);

define('HDOM_TYPE_ROOT',    5);

define('HDOM_TYPE_UNKNOWN', 6);

define('HDOM_QUOTE_DOUBLE', 0);

define('HDOM_QUOTE_SINGLE', 1);

define('HDOM_QUOTE_NO',     3);

define('HDOM_INFO_BEGIN',   0);

define('HDOM_INFO_END',     1);

define('HDOM_INFO_QUOTE',   2);

define('HDOM_INFO_SPACE',   3);

define('HDOM_INFO_TEXT',    4);

define('HDOM_INFO_INNER',   5);

define('HDOM_INFO_OUTER',   6);

define('HDOM_INFO_ENDSPACE',7);

define('DEFAULT_TARGET_CHARSET', 'UTF-8');

define('DEFAULT_BR_TEXT', "\r\n");

define('DEFAULT_SPAN_TEXT', " ");

define('MAX_FILE_SIZE', 600000);

// helper functions

// -----------------------------------------------------------------------------

// get html dom from file

// $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1.

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

{

    // We DO force the tags to be terminated.

    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.

    $contents = file_get_contents($url, $use_include_path, $context, $offset);

    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.

    //$contents = retrieve_url_contents($url);

    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)

    {

        return false;

    }

    // The second parameter can force the selectors to all be lowercase.

    $dom->load($contents, $lowercase, $stripRN);

    return $dom;

}



// get html dom from string

function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

{

    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

    if (empty($str) || strlen($str) > MAX_FILE_SIZE)

    {

        $dom->clear();

        return false;

    }

    $dom->load($str, $lowercase, $stripRN);

    return $dom;

}



// dump html dom tree

function dump_html_tree($node, $show_attr=true, $deep=0)

{

    $node->dump($node);

}





/**

 * simple html dom node

 * PaperG - added ability for "find" routine to lowercase the value of the selector.

 * PaperG - added $tag_start to track the start position of the tag in the total byte index

 *

 * @package PlaceLocalInclude

 */

class simple_html_dom_node

{

    public $nodetype = HDOM_TYPE_TEXT;

    public $tag = 'text';

    public $attr = array();

    public $children = array();

    public $nodes = array();

    public $parent = null;

    // The "info" array - see HDOM_INFO_... for what each element contains.

    public $_ = array();

    public $tag_start = 0;

    private $dom = null;



    function __construct($dom)

    {

        $this->dom = $dom;

        $dom->nodes[] = $this;

    }



    function __destruct()

    {

        $this->clear();

    }



    function __toString()

    {

        return $this->outertext();

    }



    // clean up memory due to php5 circular references memory leak...

    function clear()

    {

        $this->dom = null;

        $this->nodes = null;

        $this->parent = null;

        $this->children = null;

    }



    // dump node's tree

    function dump($show_attr=true, $deep=0)

    {

        $lead = str_repeat('    ', $deep);



        echo $lead.$this->tag;

        if ($show_attr && count($this->attr)>0)

        {

            echo '(';

            foreach ($this->attr as $k=>$v)

                echo "[$k]=>\"".$this->$k.'", ';

            echo ')';

        }

        echo "\n";



        if ($this->nodes)

        {

            foreach ($this->nodes as $c)

            {

                $c->dump($show_attr, $deep+1);

            }

        }

    }





    // Debugging function to dump a single dom node with a bunch of information about it.

    function dump_node($echo=true)

    {



        $string = $this->tag;

        if (count($this->attr)>0)

        {

            $string .= '(';

            foreach ($this->attr as $k=>$v)

            {

                $string .= "[$k]=>\"".$this->$k.'", ';

            }

            $string .= ')';

        }

        if (count($this->_)>0)

        {

            $string .= ' $_ (';

            foreach ($this->_ as $k=>$v)

            {

                if (is_array($v))

                {

                    $string .= "[$k]=>(";

                    foreach ($v as $k2=>$v2)

                    {

                        $string .= "[$k2]=>\"".$v2.'", ';

                    }

                    $string .= ")";

                } else {

                    $string .= "[$k]=>\"".$v.'", ';

                }

            }

            $string .= ")";

        }



        if (isset($this->text))

        {

            $string .= " text: (" . $this->text . ")";

        }



        $string .= " HDOM_INNER_INFO: '";

        if (isset($node->_[HDOM_INFO_INNER]))

        {

            $string .= $node->_[HDOM_INFO_INNER] . "'";

        }

        else

        {

            $string .= ' NULL ';

        }



        $string .= " children: " . count($this->children);

        $string .= " nodes: " . count($this->nodes);

        $string .= " tag_start: " . $this->tag_start;

        $string .= "\n";



        if ($echo)

        {

            echo $string;

            return;

        }

        else

        {

            return $string;

        }

    }



    // returns the parent of node

    // If a node is passed in, it will reset the parent of the current node to that one.

    function parent($parent=null)

    {

        // I am SURE that this doesn't work properly.

        // It fails to unset the current node from it's current parents nodes or children list first.

        if ($parent !== null)

        {

            $this->parent = $parent;

            $this->parent->nodes[] = $this;

            $this->parent->children[] = $this;

        }



        return $this->parent;

    }



    // verify that node has children

    function has_child()

    {

        return !empty($this->children);

    }



    // returns children of node

    function children($idx=-1)

    {

        if ($idx===-1)

        {

            return $this->children;

        }

        if (isset($this->children[$idx])) return $this->children[$idx];

        return null;

    }



    // returns the first child of node

    function first_child()

    {

        if (count($this->children)>0)

        {

            return $this->children[0];

        }

        return null;

    }



    // returns the last child of node

    function last_child()

    {

        if (($count=count($this->children))>0)

        {

            return $this->children[$count-1];

        }

        return null;

    }



    // returns the next sibling of node

    function next_sibling()

    {

        if ($this->parent===null)

        {

            return null;

        }



        $idx = 0;

        $count = count($this->parent->children);

        while ($idx<$count && $this!==$this->parent->children[$idx])

        {

            ++$idx;

        }

        if (++$idx>=$count)

        {

            return null;

        }

        return $this->parent->children[$idx];

    }



    // returns the previous sibling of node

    function prev_sibling()

    {

        if ($this->parent===null) return null;

        $idx = 0;

        $count = count($this->parent->children);

        while ($idx<$count && $this!==$this->parent->children[$idx])

            ++$idx;

        if (--$idx<0) return null;

        return $this->parent->children[$idx];

    }



    // function to locate a specific ancestor tag in the path to the root.

    function find_ancestor_tag($tag)

    {

        global $debugObject;

        if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }



        // Start by including ourselves in the comparison.

        $returnDom = $this;



        while (!is_null($returnDom))

        {

            if (is_object($debugObject)) { $debugObject->debugLog(2, "Current tag is: " . $returnDom->tag); }



            if ($returnDom->tag == $tag)

            {

                break;

            }

            $returnDom = $returnDom->parent;

        }

        return $returnDom;

    }



    // get dom node's inner html

    function innertext()

    {

        if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];

        if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);



        $ret = '';

        foreach ($this->nodes as $n)

            $ret .= $n->outertext();

        return $ret;

    }



    // get dom node's outer text (with tag)

    function outertext()

    {

        global $debugObject;

        if (is_object($debugObject))

        {

            $text = '';

            if ($this->tag == 'text')

            {

                if (!empty($this->text))

                {

                    $text = " with text: " . $this->text;

                }

            }

            $debugObject->debugLog(1, 'Innertext of tag: ' . $this->tag . $text);

        }



        if ($this->tag==='root') return $this->innertext();



        // trigger callback

        if ($this->dom && $this->dom->callback!==null)

        {

            call_user_func_array($this->dom->callback, array($this));

        }



        if (isset($this->_[HDOM_INFO_OUTER])) return $this->_[HDOM_INFO_OUTER];

        if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);



        // render begin tag

        if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]])

        {

            $ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup();

        } else {

            $ret = "";

        }



        // render inner text

        if (isset($this->_[HDOM_INFO_INNER]))

        {

            // If it's a br tag...  don't return the HDOM_INNER_INFO that we may or may not have added.

            if ($this->tag != "br")

            {

                $ret .= $this->_[HDOM_INFO_INNER];

            }

        } else {

            if ($this->nodes)

            {

                foreach ($this->nodes as $n)

                {

                    $ret .= $this->convert_text($n->outertext());

                }

            }

        }



        // render end tag

        if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END]!=0)

            $ret .= '</'.$this->tag.'>';

        return $ret;

    }



    // get dom node's plain text

    function text()

    {

        if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];

        switch ($this->nodetype)

        {

            case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

            case HDOM_TYPE_COMMENT: return '';

            case HDOM_TYPE_UNKNOWN: return '';

        }

        if (strcasecmp($this->tag, 'script')===0) return '';

        if (strcasecmp($this->tag, 'style')===0) return '';



        $ret = '';

        // In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL.

        // NOTE: This indicates that there is a problem where it's set to NULL without a clear happening.

        // WHY is this happening?

        if (!is_null($this->nodes))

        {

            foreach ($this->nodes as $n)

            {

                $ret .= $this->convert_text($n->text());

            }



            // If this node is a span... add a space at the end of it so multiple spans don't run into each other.  This is plaintext after all.

            if ($this->tag == "span")

            {

                $ret .= $this->dom->default_span_text;

            }





        }

        return $ret;

    }



    function xmltext()

    {

        $ret = $this->innertext();

        $ret = str_ireplace('<![CDATA[', '', $ret);

        $ret = str_replace(']]>', '', $ret);

        return $ret;

    }



    // build node's text with tag

    function makeup()

    {

        // text, comment, unknown

        if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);



        $ret = '<'.$this->tag;

        $i = -1;



        foreach ($this->attr as $key=>$val)

        {

            ++$i;



            // skip removed attribute

            if ($val===null || $val===false)

                continue;



            $ret .= $this->_[HDOM_INFO_SPACE][$i][0];

            //no value attr: nowrap, checked selected...

            if ($val===true)

                $ret .= $key;

            else {

                switch ($this->_[HDOM_INFO_QUOTE][$i])

                {

                    case HDOM_QUOTE_DOUBLE: $quote = '"'; break;

                    case HDOM_QUOTE_SINGLE: $quote = '\''; break;

                    default: $quote = '';

                }

                $ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote;

            }

        }

        $ret = $this->dom->restore_noise($ret);

        return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>';

    }



    // find elements by css selector

    //PaperG - added ability for find to lowercase the value of the selector.

    function find($selector, $idx=null, $lowercase=false)

    {

        $selectors = $this->parse_selector($selector);

        if (($count=count($selectors))===0) return array();

        $found_keys = array();



        // find each selector

        for ($c=0; $c<$count; ++$c)

        {

            // The change on the below line was documented on the sourceforge code tracker id 2788009

            // used to be: if (($levle=count($selectors[0]))===0) return array();

            if (($levle=count($selectors[$c]))===0) return array();

            if (!isset($this->_[HDOM_INFO_BEGIN])) return array();



            $head = array($this->_[HDOM_INFO_BEGIN]=>1);



            // handle descendant selectors, no recursive!

            for ($l=0; $l<$levle; ++$l)

            {

                $ret = array();

                foreach ($head as $k=>$v)

                {

                    $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k];

                    //PaperG - Pass this optional parameter on to the seek function.

                    $n->seek($selectors[$c][$l], $ret, $lowercase);

                }

                $head = $ret;

            }



            foreach ($head as $k=>$v)

            {

                if (!isset($found_keys[$k]))

                    $found_keys[$k] = 1;

            }

        }



        // sort keys

        ksort($found_keys);



        $found = array();

        foreach ($found_keys as $k=>$v)

            $found[] = $this->dom->nodes[$k];



        // return nth-element or array

        if (is_null($idx)) return $found;

        else if ($idx<0) $idx = count($found) + $idx;

        return (isset($found[$idx])) ? $found[$idx] : null;

    }



    // seek for given conditions

    // PaperG - added parameter to allow for case insensitive testing of the value of a selector.

    protected function seek($selector, &$ret, $lowercase=false)

    {

        global $debugObject;

        if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }



        list($tag, $key, $val, $exp, $no_key) = $selector;



        // xpath index

        if ($tag && $key && is_numeric($key))

        {

            $count = 0;

            foreach ($this->children as $c)

            {

                if ($tag==='*' || $tag===$c->tag) {

                    if (++$count==$key) {

                        $ret[$c->_[HDOM_INFO_BEGIN]] = 1;

                        return;

                    }

                }

            }

            return;

        }



        $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;

        if ($end==0) {

            $parent = $this->parent;

            while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) {

                $end -= 1;

                $parent = $parent->parent;

            }

            $end += $parent->_[HDOM_INFO_END];

        }



        for ($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) {

            $node = $this->dom->nodes[$i];



            $pass = true;



            if ($tag==='*' && !$key) {

                if (in_array($node, $this->children, true))

                    $ret[$i] = 1;

                continue;

            }



            // compare tag

            if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;}

            // compare key

            if ($pass && $key) {

                if ($no_key) {

                    if (isset($node->attr[$key])) $pass=false;

                } else {

                    if (($key != "plaintext") && !isset($node->attr[$key])) $pass=false;

                }

            }

            // compare value

            if ($pass && $key && $val  && $val!=='*') {

                // If they have told us that this is a "plaintext" search then we want the plaintext of the node - right?

                if ($key == "plaintext") {

                    // $node->plaintext actually returns $node->text();

                    $nodeKeyValue = $node->text();

                } else {

                    // this is a normal search, we want the value of that attribute of the tag.

                    $nodeKeyValue = $node->attr[$key];

                }

                if (is_object($debugObject)) {$debugObject->debugLog(2, "testing node: " . $node->tag . " for attribute: " . $key . $exp . $val . " where nodes value is: " . $nodeKeyValue);}



                //PaperG - If lowercase is set, do a case insensitive test of the value of the selector.

                if ($lowercase) {

                    $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));

                } else {

                    $check = $this->match($exp, $val, $nodeKeyValue);

                }

                if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}



                // handle multiple class

                if (!$check && strcasecmp($key, 'class')===0) {

                    foreach (explode(' ',$node->attr[$key]) as $k) {

                        // Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form.

                        if (!empty($k)) {

                            if ($lowercase) {

                                $check = $this->match($exp, strtolower($val), strtolower($k));

                            } else {

                                $check = $this->match($exp, $val, $k);

                            }

                            if ($check) break;

                        }

                    }

                }

                if (!$check) $pass = false;

            }

            if ($pass) $ret[$i] = 1;

            unset($node);

        }

        // It's passed by reference so this is actually what this function returns.

        if (is_object($debugObject)) {$debugObject->debugLog(1, "EXIT - ret: ", $ret);}

    }



    protected function match($exp, $pattern, $value) {

        global $debugObject;

        if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}



        switch ($exp) {

            case '=':

                return ($value===$pattern);

            case '!=':

                return ($value!==$pattern);

            case '^=':

                return preg_match("/^".preg_quote($pattern,'/')."/", $value);

            case '$=':

                return preg_match("/".preg_quote($pattern,'/')."$/", $value);

            case '*=':

                if ($pattern[0]=='/') {

                    return preg_match($pattern, $value);

                }

                return preg_match("/".$pattern."/i", $value);

        }

        return false;

    }



    protected function parse_selector($selector_string) {

        global $debugObject;

        if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}



        // pattern of CSS selectors, modified from mootools

        // Paperg: Add the colon to the attrbute, so that it properly finds <tag attr:ibute="something" > like google does.

        // Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check.

// Notice the \[ starting the attbute?  and the @? following?  This implies that an attribute can begin with an @ sign that is not captured.

// This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression.

// farther study is required to determine of this should be documented or removed.

//        $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";

        $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-:]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is";

        preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER);

        if (is_object($debugObject)) {$debugObject->debugLog(2, "Matches Array: ", $matches);}



        $selectors = array();

        $result = array();

        //print_r($matches);



        foreach ($matches as $m) {

            $m[0] = trim($m[0]);

            if ($m[0]==='' || $m[0]==='/' || $m[0]==='//') continue;

            // for browser generated xpath

            if ($m[1]==='tbody') continue;



            list($tag, $key, $val, $exp, $no_key) = array($m[1], null, null, '=', false);

            if (!empty($m[2])) {$key='id'; $val=$m[2];}

            if (!empty($m[3])) {$key='class'; $val=$m[3];}

            if (!empty($m[4])) {$key=$m[4];}

            if (!empty($m[5])) {$exp=$m[5];}

            if (!empty($m[6])) {$val=$m[6];}



            // convert to lowercase

            if ($this->dom->lowercase) {$tag=strtolower($tag); $key=strtolower($key);}

            //elements that do NOT have the specified attribute

            if (isset($key[0]) && $key[0]==='!') {$key=substr($key, 1); $no_key=true;}



            $result[] = array($tag, $key, $val, $exp, $no_key);

            if (trim($m[7])===',') {

                $selectors[] = $result;

                $result = array();

            }

        }

        if (count($result)>0)

            $selectors[] = $result;

        return $selectors;

    }



    function __get($name) {

        if (isset($this->attr[$name]))

        {

            return $this->convert_text($this->attr[$name]);

        }

        switch ($name) {

            case 'outertext': return $this->outertext();

            case 'innertext': return $this->innertext();

            case 'plaintext': return $this->text();

            case 'xmltext': return $this->xmltext();

            default: return array_key_exists($name, $this->attr);

        }

    }



    function __set($name, $value) {

        switch ($name) {

            case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value;

            case 'innertext':

                if (isset($this->_[HDOM_INFO_TEXT])) return $this->_[HDOM_INFO_TEXT] = $value;

                return $this->_[HDOM_INFO_INNER] = $value;

        }

        if (!isset($this->attr[$name])) {

            $this->_[HDOM_INFO_SPACE][] = array(' ', '', '');

            $this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;

        }

        $this->attr[$name] = $value;

    }



    function __isset($name) {

        switch ($name) {

            case 'outertext': return true;

            case 'innertext': return true;

            case 'plaintext': return true;

        }

        //no value attr: nowrap, checked selected...

        return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]);

    }



    function __unset($name) {

        if (isset($this->attr[$name]))

            unset($this->attr[$name]);

    }



    // PaperG - Function to convert the text from one character set to another if the two sets are not the same.

    function convert_text($text)

    {

        global $debugObject;

        if (is_object($debugObject)) {$debugObject->debugLogEntry(1);}



        $converted_text = $text;



        $sourceCharset = "";

        $targetCharset = "";



        if ($this->dom)

        {

            $sourceCharset = strtoupper($this->dom->_charset);

            $targetCharset = strtoupper($this->dom->_target_charset);

        }

        if (is_object($debugObject)) {$debugObject->debugLog(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);}



        if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0))

        {

            // Check if the reported encoding could have been incorrect and the text is actually already UTF-8

            if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text)))

            {

                $converted_text = $text;

            }

            else

            {

                $converted_text = iconv($sourceCharset, $targetCharset, $text);

            }

        }



        // Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.

        if ($targetCharset == 'UTF-8')

        {

            if (substr($converted_text, 0, 3) == "\xef\xbb\xbf")

            {

                $converted_text = substr($converted_text, 3);

            }

            if (substr($converted_text, -3) == "\xef\xbb\xbf")

            {

                $converted_text = substr($converted_text, 0, -3);

            }

        }



        return $converted_text;

    }



    /**

    * Returns true if $string is valid UTF-8 and false otherwise.

    *

    * @param mixed $str String to be tested

    * @return boolean

    */

    static function is_utf8($str)

    {

        $c=0; $b=0;

        $bits=0;

        $len=strlen($str);

        for($i=0; $i<$len; $i++)

        {

            $c=ord($str[$i]);

            if($c > 128)

            {

                if(($c >= 254)) return false;

                elseif($c >= 252) $bits=6;

                elseif($c >= 248) $bits=5;

                elseif($c >= 240) $bits=4;

                elseif($c >= 224) $bits=3;

                elseif($c >= 192) $bits=2;

                else return false;

                if(($i+$bits) > $len) return false;

                while($bits > 1)

                {

                    $i++;

                    $b=ord($str[$i]);

                    if($b < 128 || $b > 191) return false;

                    $bits--;

                }

            }

        }

        return true;

    }

    /*

    function is_utf8($string)

    {

        //this is buggy

        return (utf8_encode(utf8_decode($string)) == $string);

    }

    */



    /**

     * Function to try a few tricks to determine the displayed size of an img on the page.

     * NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types.

     *

     * @author John Schlick

     * @version April 19 2012

     * @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out.

     */

    function get_display_size()

    {

        global $debugObject;



        $width = -1;

        $height = -1;



        if ($this->tag !== 'img')

        {

            return false;

        }



        // See if there is aheight or width attribute in the tag itself.

        if (isset($this->attr['width']))

        {

            $width = $this->attr['width'];

        }



        if (isset($this->attr['height']))

        {

            $height = $this->attr['height'];

        }



        // Now look for an inline style.

        if (isset($this->attr['style']))

        {

            // Thanks to user gnarf from stackoverflow for this regular expression.

            $attributes = array();

            preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER);

            foreach ($matches as $match) {

              $attributes[$match[1]] = $match[2];

            }



            // If there is a width in the style attributes:

            if (isset($attributes['width']) && $width == -1)

            {

                // check that the last two characters are px (pixels)

                if (strtolower(substr($attributes['width'], -2)) == 'px')

                {

                    $proposed_width = substr($attributes['width'], 0, -2);

                    // Now make sure that it's an integer and not something stupid.

                    if (filter_var($proposed_width, FILTER_VALIDATE_INT))

                    {

                        $width = $proposed_width;

                    }

                }

            }



            // If there is a width in the style attributes:

            if (isset($attributes['height']) && $height == -1)

            {

                // check that the last two characters are px (pixels)

                if (strtolower(substr($attributes['height'], -2)) == 'px')

                {

                    $proposed_height = substr($attributes['height'], 0, -2);

                    // Now make sure that it's an integer and not something stupid.

                    if (filter_var($proposed_height, FILTER_VALIDATE_INT))

                    {

                        $height = $proposed_height;

                    }

                }

            }



        }



        // Future enhancement:

        // Look in the tag to see if there is a class or id specified that has a height or width attribute to it.



        // Far future enhancement

        // Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width

        // Note that in this case, the class or id will have the img subselector for it to apply to the image.



        // ridiculously far future development

        // If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page.



        $result = array('height' => $height,

                        'width' => $width);

        return $result;

    }



    // camel naming conventions

    function getAllAttributes() {return $this->attr;}

    function getAttribute($name) {return $this->__get($name);}

    function setAttribute($name, $value) {$this->__set($name, $value);}

    function hasAttribute($name) {return $this->__isset($name);}

    function removeAttribute($name) {$this->__set($name, null);}

    function getElementById($id) {return $this->find("#$id", 0);}

    function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}

    function getElementByTagName($name) {return $this->find($name, 0);}

    function getElementsByTagName($name, $idx=null) {return $this->find($name, $idx);}

    function parentNode() {return $this->parent();}

    function childNodes($idx=-1) {return $this->children($idx);}

    function firstChild() {return $this->first_child();}

    function lastChild() {return $this->last_child();}

    function nextSibling() {return $this->next_sibling();}

    function previousSibling() {return $this->prev_sibling();}

    function hasChildNodes() {return $this->has_child();}

    function nodeName() {return $this->tag;}

    function appendChild($node) {$node->parent($this); return $node;}



}



/**

 * simple html dom parser

 * Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector.

 * Paperg - change $size from protected to public so we can easily access it

 * Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not.  Default is to NOT trust it.

 *

 * @package PlaceLocalInclude

 */

class simple_html_dom

{

    public $root = null;

    public $nodes = array();

    public $callback = null;

    public $lowercase = false;

    // Used to keep track of how large the text was when we started.

    public $original_size;

    public $size;

    protected $pos;

    protected $doc;

    protected $char;

    protected $cursor;

    protected $parent;

    protected $noise = array();

    protected $token_blank = " \t\r\n";

    protected $token_equal = ' =/>';

    protected $token_slash = " />\r\n\t";

    protected $token_attr = ' >';

    // Note that this is referenced by a child node, and so it needs to be public for that node to see this information.

    public $_charset = '';

    public $_target_charset = '';

    protected $default_br_text = "";

    public $default_span_text = "";



    // use isset instead of in_array, performance boost about 30%...

    protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);

    protected $block_tags = array('root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1);

    // Known sourceforge issue #2977341

    // B tags that are not closed cause us to return everything to the end of the document.

    protected $optional_closing_tags = array(

        'tr'=>array('tr'=>1, 'td'=>1, 'th'=>1),

        'th'=>array('th'=>1),

        'td'=>array('td'=>1),

        'li'=>array('li'=>1),

        'dt'=>array('dt'=>1, 'dd'=>1),

        'dd'=>array('dd'=>1, 'dt'=>1),

        'dl'=>array('dd'=>1, 'dt'=>1),

        'p'=>array('p'=>1),

        'nobr'=>array('nobr'=>1),

        'b'=>array('b'=>1),

        'option'=>array('option'=>1),

    );



    function __construct($str=null, $lowercase=true, $forceTagsClosed=true, $target_charset=DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

    {

        if ($str)

        {

            if (preg_match("/^http:\/\//i",$str) || is_file($str))

            {

                $this->load_file($str);

            }

            else

            {

                $this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);

            }

        }

        // Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html.

        if (!$forceTagsClosed) {

            $this->optional_closing_array=array();

        }

        $this->_target_charset = $target_charset;

    }



    function __destruct()

    {

        $this->clear();

    }



    // load html from string

    function load($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

    {

        global $debugObject;



        // prepare

        $this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);

        // strip out comments

        $this->remove_noise("'<!--(.*?)-->'is");

        // strip out cdata

        $this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);

        // Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037

        // Script tags removal now preceeds style tag removal.

        // strip out <script> tags

        $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");

        $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");

        // strip out <style> tags

        $this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");

        $this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");

        // strip out preformatted tags

        $this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");

        // strip out server side scripts

        $this->remove_noise("'(<\?)(.*?)(\?>)'s", true);

        // strip smarty scripts

        $this->remove_noise("'(\{\w)(.*?)(\})'s", true);



        // parsing

        while ($this->parse());

        // end

        $this->root->_[HDOM_INFO_END] = $this->cursor;

        $this->parse_charset();



        // make load function chainable

        return $this;



    }



    // load html from file

    function load_file()

    {

        $args = func_get_args();

        $this->load(call_user_func_array('file_get_contents', $args), true);

        // Throw an error if we can't properly load the dom.

        if (($error=error_get_last())!==null) {

            $this->clear();

            return false;

        }

    }



    // set callback function

    function set_callback($function_name)

    {

        $this->callback = $function_name;

    }



    // remove callback function

    function remove_callback()

    {

        $this->callback = null;

    }



    // save dom as string

    function save($filepath='')

    {

        $ret = $this->root->innertext();

        if ($filepath!=='') file_put_contents($filepath, $ret, LOCK_EX);

        return $ret;

    }



    // find dom node by css selector

    // Paperg - allow us to specify that we want case insensitive testing of the value of the selector.

    function find($selector, $idx=null, $lowercase=false)

    {

        return $this->root->find($selector, $idx, $lowercase);

    }



    // clean up memory due to php5 circular references memory leak...

    function clear()

    {

        foreach ($this->nodes as $n) {$n->clear(); $n = null;}

        // This add next line is documented in the sourceforge repository. 2977248 as a fix for ongoing memory leaks that occur even with the use of clear.

        if (isset($this->children)) foreach ($this->children as $n) {$n->clear(); $n = null;}

        if (isset($this->parent)) {$this->parent->clear(); unset($this->parent);}

        if (isset($this->root)) {$this->root->clear(); unset($this->root);}

        unset($this->doc);

        unset($this->noise);

    }



    function dump($show_attr=true)

    {

        $this->root->dump($show_attr);

    }



    // prepare HTML data and init everything

    protected function prepare($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

    {

        $this->clear();



        // set the length of content before we do anything to it.

        $this->size = strlen($str);

        // Save the original size of the html that we got in.  It might be useful to someone.

        $this->original_size = $this->size;



        //before we save the string as the doc...  strip out the \r \n's if we are told to.

        if ($stripRN) {

            $str = str_replace("\r", " ", $str);

            $str = str_replace("\n", " ", $str);



            // set the length of content since we have changed it.

            $this->size = strlen($str);

        }



        $this->doc = $str;

        $this->pos = 0;

        $this->cursor = 1;

        $this->noise = array();

        $this->nodes = array();

        $this->lowercase = $lowercase;

        $this->default_br_text = $defaultBRText;

        $this->default_span_text = $defaultSpanText;

        $this->root = new simple_html_dom_node($this);

        $this->root->tag = 'root';

        $this->root->_[HDOM_INFO_BEGIN] = -1;

        $this->root->nodetype = HDOM_TYPE_ROOT;

        $this->parent = $this->root;

        if ($this->size>0) $this->char = $this->doc[0];

    }



    // parse html content

    protected function parse()

    {

        if (($s = $this->copy_until_char('<'))==='')

        {

            return $this->read_tag();

        }



        // text

        $node = new simple_html_dom_node($this);

        ++$this->cursor;

        $node->_[HDOM_INFO_TEXT] = $s;

        $this->link_nodes($node, false);

        return true;

    }



    // PAPERG - dkchou - added this to try to identify the character set of the page we have just parsed so we know better how to spit it out later.

    // NOTE:  IF you provide a routine called get_last_retrieve_url_contents_content_type which returns the CURLINFO_CONTENT_TYPE from the last curl_exec

    // (or the content_type header from the last transfer), we will parse THAT, and if a charset is specified, we will use it over any other mechanism.

    protected function parse_charset()

    {

        global $debugObject;



        $charset = null;



        if (function_exists('get_last_retrieve_url_contents_content_type'))

        {

            $contentTypeHeader = get_last_retrieve_url_contents_content_type();

            $success = preg_match('/charset=(.+)/', $contentTypeHeader, $matches);

            if ($success)

            {

                $charset = $matches[1];

                if (is_object($debugObject)) {$debugObject->debugLog(2, 'header content-type found charset of: ' . $charset);}

            }



        }



        if (empty($charset))

        {

            $el = $this->root->find('meta[http-equiv=Content-Type]',0);

            if (!empty($el))

            {

                $fullvalue = $el->content;

                if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag found' . $fullvalue);}



                if (!empty($fullvalue))

                {

                    $success = preg_match('/charset=(.+)/', $fullvalue, $matches);

                    if ($success)

                    {

                        $charset = $matches[1];

                    }

                    else

                    {

                        // If there is a meta tag, and they don't specify the character set, research says that it's typically ISO-8859-1

                        if (is_object($debugObject)) {$debugObject->debugLog(2, 'meta content-type tag couldn\'t be parsed. using iso-8859 default.');}

                        $charset = 'ISO-8859-1';

                    }

                }

            }

        }



        // If we couldn't find a charset above, then lets try to detect one based on the text we got...

        if (empty($charset))

        {

            // Have php try to detect the encoding from the text given to us.

            $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) );

            if (is_object($debugObject)) {$debugObject->debugLog(2, 'mb_detect found: ' . $charset);}



            // and if this doesn't work...  then we need to just wrongheadedly assume it's UTF-8 so that we can move on - cause this will usually give us most of what we need...

            if ($charset === false)

            {

                if (is_object($debugObject)) {$debugObject->debugLog(2, 'since mb_detect failed - using default of utf-8');}

                $charset = 'UTF-8';

            }

        }



        // Since CP1252 is a superset, if we get one of it's subsets, we want it instead.

        if ((strtolower($charset) == strtolower('ISO-8859-1')) || (strtolower($charset) == strtolower('Latin1')) || (strtolower($charset) == strtolower('Latin-1')))

        {

            if (is_object($debugObject)) {$debugObject->debugLog(2, 'replacing ' . $charset . ' with CP1252 as its a superset');}

            $charset = 'CP1252';

        }



        if (is_object($debugObject)) {$debugObject->debugLog(1, 'EXIT - ' . $charset);}



        return $this->_charset = $charset;

    }



    // read tag info

    protected function read_tag()

    {

        if ($this->char!=='<')

        {

            $this->root->_[HDOM_INFO_END] = $this->cursor;

            return false;

        }

        $begin_tag_pos = $this->pos;

        $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next



        // end tag

        if ($this->char==='/')

        {

            $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

            // This represents the change in the simple_html_dom trunk from revision 180 to 181.

            // $this->skip($this->token_blank_t);

            $this->skip($this->token_blank);

            $tag = $this->copy_until_char('>');



            // skip attributes in end tag

            if (($pos = strpos($tag, ' '))!==false)

                $tag = substr($tag, 0, $pos);



            $parent_lower = strtolower($this->parent->tag);

            $tag_lower = strtolower($tag);



            if ($parent_lower!==$tag_lower)

            {

                if (isset($this->optional_closing_tags[$parent_lower]) && isset($this->block_tags[$tag_lower]))

                {

                    $this->parent->_[HDOM_INFO_END] = 0;

                    $org_parent = $this->parent;



                    while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)

                        $this->parent = $this->parent->parent;



                    if (strtolower($this->parent->tag)!==$tag_lower) {

                        $this->parent = $org_parent; // restore origonal parent

                        if ($this->parent->parent) $this->parent = $this->parent->parent;

                        $this->parent->_[HDOM_INFO_END] = $this->cursor;

                        return $this->as_text_node($tag);

                    }

                }

                else if (($this->parent->parent) && isset($this->block_tags[$tag_lower]))

                {

                    $this->parent->_[HDOM_INFO_END] = 0;

                    $org_parent = $this->parent;



                    while (($this->parent->parent) && strtolower($this->parent->tag)!==$tag_lower)

                        $this->parent = $this->parent->parent;



                    if (strtolower($this->parent->tag)!==$tag_lower)

                    {

                        $this->parent = $org_parent; // restore origonal parent

                        $this->parent->_[HDOM_INFO_END] = $this->cursor;

                        return $this->as_text_node($tag);

                    }

                }

                else if (($this->parent->parent) && strtolower($this->parent->parent->tag)===$tag_lower)

                {

                    $this->parent->_[HDOM_INFO_END] = 0;

                    $this->parent = $this->parent->parent;

                }

                else

                    return $this->as_text_node($tag);

            }



            $this->parent->_[HDOM_INFO_END] = $this->cursor;

            if ($this->parent->parent) $this->parent = $this->parent->parent;



            $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

            return true;

        }



        $node = new simple_html_dom_node($this);

        $node->_[HDOM_INFO_BEGIN] = $this->cursor;

        ++$this->cursor;

        $tag = $this->copy_until($this->token_slash);

        $node->tag_start = $begin_tag_pos;



        // doctype, cdata & comments...

        if (isset($tag[0]) && $tag[0]==='!') {

            $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until_char('>');



            if (isset($tag[2]) && $tag[1]==='-' && $tag[2]==='-') {

                $node->nodetype = HDOM_TYPE_COMMENT;

                $node->tag = 'comment';

            } else {

                $node->nodetype = HDOM_TYPE_UNKNOWN;

                $node->tag = 'unknown';

            }

            if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';

            $this->link_nodes($node, true);

            $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

            return true;

        }



        // text

        if ($pos=strpos($tag, '<')!==false) {

            $tag = '<' . substr($tag, 0, -1);

            $node->_[HDOM_INFO_TEXT] = $tag;

            $this->link_nodes($node, false);

            $this->char = $this->doc[--$this->pos]; // prev

            return true;

        }



        if (!preg_match("/^[\w-:]+$/", $tag)) {

            $node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');

            if ($this->char==='<') {

                $this->link_nodes($node, false);

                return true;

            }



            if ($this->char==='>') $node->_[HDOM_INFO_TEXT].='>';

            $this->link_nodes($node, false);

            $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

            return true;

        }



        // begin tag

        $node->nodetype = HDOM_TYPE_ELEMENT;

        $tag_lower = strtolower($tag);

        $node->tag = ($this->lowercase) ? $tag_lower : $tag;



        // handle optional closing tags

        if (isset($this->optional_closing_tags[$tag_lower]) )

        {

            while (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)]))

            {

                $this->parent->_[HDOM_INFO_END] = 0;

                $this->parent = $this->parent->parent;

            }

            $node->parent = $this->parent;

        }



        $guard = 0; // prevent infinity loop

        $space = array($this->copy_skip($this->token_blank), '', '');



        // attributes

        do

        {

            if ($this->char!==null && $space[0]==='')

            {

                break;

            }

            $name = $this->copy_until($this->token_equal);

            if ($guard===$this->pos)

            {

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                continue;

            }

            $guard = $this->pos;



            // handle endless '<'

            if ($this->pos>=$this->size-1 && $this->char!=='>') {

                $node->nodetype = HDOM_TYPE_TEXT;

                $node->_[HDOM_INFO_END] = 0;

                $node->_[HDOM_INFO_TEXT] = '<'.$tag . $space[0] . $name;

                $node->tag = 'text';

                $this->link_nodes($node, false);

                return true;

            }



            // handle mismatch '<'

            if ($this->doc[$this->pos-1]=='<') {

                $node->nodetype = HDOM_TYPE_TEXT;

                $node->tag = 'text';

                $node->attr = array();

                $node->_[HDOM_INFO_END] = 0;

                $node->_[HDOM_INFO_TEXT] = substr($this->doc, $begin_tag_pos, $this->pos-$begin_tag_pos-1);

                $this->pos -= 2;

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                $this->link_nodes($node, false);

                return true;

            }



            if ($name!=='/' && $name!=='') {

                $space[1] = $this->copy_skip($this->token_blank);

                $name = $this->restore_noise($name);

                if ($this->lowercase) $name = strtolower($name);

                if ($this->char==='=') {

                    $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                    $this->parse_attr($node, $name, $space);

                }

                else {

                    //no value attr: nowrap, checked selected...

                    $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;

                    $node->attr[$name] = true;

                    if ($this->char!='>') $this->char = $this->doc[--$this->pos]; // prev

                }

                $node->_[HDOM_INFO_SPACE][] = $space;

                $space = array($this->copy_skip($this->token_blank), '', '');

            }

            else

                break;

        } while ($this->char!=='>' && $this->char!=='/');



        $this->link_nodes($node, true);

        $node->_[HDOM_INFO_ENDSPACE] = $space[0];



        // check self closing

        if ($this->copy_until_char_escape('>')==='/')

        {

            $node->_[HDOM_INFO_ENDSPACE] .= '/';

            $node->_[HDOM_INFO_END] = 0;

        }

        else

        {

            // reset parent

            if (!isset($this->self_closing_tags[strtolower($node->tag)])) $this->parent = $node;

        }

        $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next



        // If it's a BR tag, we need to set it's text to the default text.

        // This way when we see it in plaintext, we can generate formatting that the user wants.

        // since a br tag never has sub nodes, this works well.

        if ($node->tag == "br")

        {

            $node->_[HDOM_INFO_INNER] = $this->default_br_text;

        }



        return true;

    }



    // parse attributes

    protected function parse_attr($node, $name, &$space)

    {

        // Per sourceforge: http://sourceforge.net/tracker/?func=detail&aid=3061408&group_id=218559&atid=1044037

        // If the attribute is already defined inside a tag, only pay atetntion to the first one as opposed to the last one.

        if (isset($node->attr[$name]))

        {

            return;

        }



        $space[2] = $this->copy_skip($this->token_blank);

        switch ($this->char) {

            case '"':

                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('"'));

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                break;

            case '\'':

                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_SINGLE;

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                $node->attr[$name] = $this->restore_noise($this->copy_until_char_escape('\''));

                $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

                break;

            default:

                $node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;

                $node->attr[$name] = $this->restore_noise($this->copy_until($this->token_attr));

        }

        // PaperG: Attributes should not have \r or \n in them, that counts as html whitespace.

        $node->attr[$name] = str_replace("\r", "", $node->attr[$name]);

        $node->attr[$name] = str_replace("\n", "", $node->attr[$name]);

        // PaperG: If this is a "class" selector, lets get rid of the preceeding and trailing space since some people leave it in the multi class case.

        if ($name == "class") {

            $node->attr[$name] = trim($node->attr[$name]);

        }

    }



    // link node's parent

    protected function link_nodes(&$node, $is_child)

    {

        $node->parent = $this->parent;

        $this->parent->nodes[] = $node;

        if ($is_child)

        {

            $this->parent->children[] = $node;

        }

    }



    // as a text node

    protected function as_text_node($tag)

    {

        $node = new simple_html_dom_node($this);

        ++$this->cursor;

        $node->_[HDOM_INFO_TEXT] = '</' . $tag . '>';

        $this->link_nodes($node, false);

        $this->char = (++$this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

        return true;

    }



    protected function skip($chars)

    {

        $this->pos += strspn($this->doc, $chars, $this->pos);

        $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

    }



    protected function copy_skip($chars)

    {

        $pos = $this->pos;

        $len = strspn($this->doc, $chars, $pos);

        $this->pos += $len;

        $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

        if ($len===0) return '';

        return substr($this->doc, $pos, $len);

    }



    protected function copy_until($chars)

    {

        $pos = $this->pos;

        $len = strcspn($this->doc, $chars, $pos);

        $this->pos += $len;

        $this->char = ($this->pos<$this->size) ? $this->doc[$this->pos] : null; // next

        return substr($this->doc, $pos, $len);

    }



    protected function copy_until_char($char)

    {

        if ($this->char===null) return '';



        if (($pos = strpos($this->doc, $char, $this->pos))===false) {

            $ret = substr($this->doc, $this->pos, $this->size-$this->pos);

            $this->char = null;

            $this->pos = $this->size;

            return $ret;

        }



        if ($pos===$this->pos) return '';

        $pos_old = $this->pos;

        $this->char = $this->doc[$pos];

        $this->pos = $pos;

        return substr($this->doc, $pos_old, $pos-$pos_old);

    }



    protected function copy_until_char_escape($char)

    {

        if ($this->char===null) return '';



        $start = $this->pos;

        while (1)

        {

            if (($pos = strpos($this->doc, $char, $start))===false)

            {

                $ret = substr($this->doc, $this->pos, $this->size-$this->pos);

                $this->char = null;

                $this->pos = $this->size;

                return $ret;

            }



            if ($pos===$this->pos) return '';



            if ($this->doc[$pos-1]==='\\') {

                $start = $pos+1;

                continue;

            }



            $pos_old = $this->pos;

            $this->char = $this->doc[$pos];

            $this->pos = $pos;

            return substr($this->doc, $pos_old, $pos-$pos_old);

        }

    }



    // remove noise from html content

    // save the noise in the $this->noise array.

    protected function remove_noise($pattern, $remove_tag=false)

    {

        global $debugObject;

        if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }



        $count = preg_match_all($pattern, $this->doc, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);



        for ($i=$count-1; $i>-1; --$i)

        {

            $key = '___noise___'.sprintf('% 5d', count($this->noise)+1000);

            if (is_object($debugObject)) { $debugObject->debugLog(2, 'key is: ' . $key); }

            $idx = ($remove_tag) ? 0 : 1;

            $this->noise[$key] = $matches[$i][$idx][0];

            $this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));

        }



        // reset the length of content

        $this->size = strlen($this->doc);

        if ($this->size>0)

        {

            $this->char = $this->doc[0];

        }

    }



    // restore noise to html content

    function restore_noise($text)

    {

        global $debugObject;

        if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }



        while (($pos=strpos($text, '___noise___'))!==false)

        {

            // Sometimes there is a broken piece of markup, and we don't GET the pos+11 etc... token which indicates a problem outside of us...

            if (strlen($text) > $pos+15)

            {

                $key = '___noise___'.$text[$pos+11].$text[$pos+12].$text[$pos+13].$text[$pos+14].$text[$pos+15];

                if (is_object($debugObject)) { $debugObject->debugLog(2, 'located key of: ' . $key); }



                if (isset($this->noise[$key]))

                {

                    $text = substr($text, 0, $pos).$this->noise[$key].substr($text, $pos+16);

                }

                else

                {

                    // do this to prevent an infinite loop.

                    $text = substr($text, 0, $pos).'UNDEFINED NOISE FOR KEY: '.$key . substr($text, $pos+16);

                }

            }

            else

            {

                // There is no valid key being given back to us... We must get rid of the ___noise___ or we will have a problem.

                $text = substr($text, 0, $pos).'NO NUMERIC NOISE KEY' . substr($text, $pos+11);

            }

        }

        return $text;

    }



    // Sometimes we NEED one of the noise elements.

    function search_noise($text)

    {

        global $debugObject;

        if (is_object($debugObject)) { $debugObject->debugLogEntry(1); }



        foreach($this->noise as $noiseElement)

        {

            if (strpos($noiseElement, $text)!==false)

            {

                return $noiseElement;

            }

        }

    }

    function __toString()

    {

        return $this->root->innertext();

    }



    function __get($name)

    {

        switch ($name)

        {

            case 'outertext':

                return $this->root->innertext();

            case 'innertext':

                return $this->root->innertext();

            case 'plaintext':

                return $this->root->text();

            case 'charset':

                return $this->_charset;

            case 'target_charset':

                return $this->_target_charset;

        }

    }



    // camel naming conventions

    function childNodes($idx=-1) {return $this->root->childNodes($idx);}

    function firstChild() {return $this->root->first_child();}

    function lastChild() {return $this->root->last_child();}

    function createElement($name, $value=null) {return @str_get_html("<$name>$value</$name>")->first_child();}

    function createTextNode($value) {return @end(str_get_html($value)->nodes);}

    function getElementById($id) {return $this->find("#$id", 0);}

    function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);}

    function getElementByTagName($name) {return $this->find($name, 0);}

    function getElementsByTagName($name, $idx=-1) {return $this->find($name, $idx);}

    function loadFile() {$args = func_get_args();$this->load_file($args);}

}



?>

your best choice for scraping development

Home → Scraping source code

Copyright © 2018 compunect