znazz75.de IT-Blog
I use “symfony/panther” to achieve the goal to scrape websites by the use of logic patterns.
Login into linux. Install “composer” to install “symfony/panther” and some needed packages:
|
1 |
apt install composer php-xml php-curl php-zip chromium-driver |
Make a directory in your home folder or elsewhere for your script and install “symfony/panther” and “symfony/css-selector” as user. You don’t need to install as development if asked.
|
1 2 |
composer req symfony/panther composer req symfony/css-selector |
Create a php file, I called it scrap.php, inside the directory you just created. You can use my following example code. You can execute the file on the command line with:
|
1 |
php scrap.php |
Example code:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
<?php use Symfony\Component\Panther\Client; use Facebook\WebDriver\Exception\TimeoutException; // load panther require __DIR__ . '/vendor/autoload.php'; // start browser and load website $client = Client::createChromeClient(); $client->request( 'GET', 'https://testwebsite.org/test.html' ); // click link $client->clickLink( 'Search' ); // wait for element $client->waitFor( '#tabs' ); // select element 100 from list and click it $input = $crawler->filterXPath( ".//select[@name='length']//option[@value='100']" ); $input->click(); // wait for visibility $client->waitForVisibility( '#processing' ); try { $client->waitForInvisibility( '#processing' ); } catch ( TimeoutException $e ) { // Nothing } for ( $page = 1; $page <= 2; $page ++ ) { // get full html code $html = $client->getCrawler()->html(); $crawler = new Symfony\Component\DomCrawler\Crawler( $html ); // get the whole table $result = $crawler->filter( '#results' )->outerHtml(); // write data to file $myfile = fopen( "table_" . $page . ".txt", "w" ) or die( "Unable to open file!" ); fwrite( $myfile, $result ); fclose( $myfile ); // click next page $client->executeScript( "document.querySelector('#next').click()" ); // wait for element $client->waitForVisibility( '#processing' ); try { $client->waitForInvisibility( '#processing' ); } catch ( TimeoutException $e ) { // Nothing } } |
Here is a example how to scrap a popup window from a table entry with pages:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
<?php use Symfony\Component\Panther\Client; use Facebook\WebDriver\Exception\TimeoutException; use Facebook\WebDriver\Exception\NoSuchElementException; // load panther require __DIR__ . '/vendor/autoload.php'; // start browser and load website $client = Client::createChromeClient(); $client->request( 'GET', 'https://testwebsite.org/test.html' ); // click link $client->clickLink( 'Search' ); // wait for element $crawler = $client->waitFor( '#tabssearch' ); // select element 100 from list and click it $input = $crawler->filterXPath( ".//select[@name='length']//option[@value='100']" ); $input->click(); // wait for visibility $client->waitForVisibility( '#processing' ); // wait for invisibility try { $client->waitForInvisibility( '#processing' ); } catch ( TimeoutException $e ) { // Nothing } // take a screenshot $client->takeScreenshot( 'screen.png' ); for ( $page = 1; $page < 3; $page ++ ) { for ( $row = 1; $row <= 100; $row ++ ) { if ( $page == 2 && $row > 4 ) { continue; } // click a image inside a table by row and col to open popup $client->executeScript( "document.querySelector('#results tbody tr:nth-of-type(" . $row . ") td:nth-of-type(1) img').click()" ); // if it doesn't succeed repeat step try { $client->waitForVisibility( '#buttons' ); } catch ( NoSuchElementException $e ) { $row -= 1; continue; } // get complete html code $html = $client->getCrawler()->html(); $crawler = new Symfony\Component\DomCrawler\Crawler( $html ); // get html code of the popup $result = $crawler->filter( '.ui-dialog' )->outerHtml(); // Write date to file $myfile = fopen( "item_" . $page . "_" . $row . ".txt", "w" ) or die( "Unable to open file!" ); fwrite( $myfile, $result ); fclose( $myfile ); // Close popup $client->executeScript( "document.querySelector('#buttons #close').click()" ); try { $client->waitForInvisibility( '#buttons' ); } catch ( TimeoutException $e ) { // Nothing } } // click next page $client->executeScript( "document.querySelector('#next').click()" ); $client->waitForVisibility( '#processing' ); try { $client->waitForInvisibility( '#processing' ); } catch ( TimeoutException $e ) { // Nothing } } |
Example to get the desired informations out of the files:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
<?php require __DIR__ . '/vendor/autoload.php'; // create a mysql connection $mysqli = mysqli_connect( "localhost", "test", "test", "new_database" ); mysqli_set_charset( $mysqli, "utf8mb4" ); $csv_head = "\"id\",\"name\",\"country\""; $file = fopen( 'names.csv', 'w' ); fwrite( $file, $csv_head . "\n" ); $countries = []; $id = 1; $match = []; for ( $page = 1; $page <= 2; $page ++ ) { for ( $row = 1; $row <= 100; $row ++ ) { if ( $page == 2 & $row > 4 ) { break; } // get file content $html = file_get_contents( "./item_" . $page . "_" . $row . ".txt" ); if(empty($html)) die(); // start crawler $crawler = new Symfony\Component\DomCrawler\Crawler(); $crawler->addHtmlContent( $html ); // get text inside the first element of the table try { $country = $crawler->filterXPath( '//*[@class="header"]/tbody/tr[1]/td[1]' )->text(); // remove unwanted spaces $country = trim(preg_replace( '/\s+/', ' ', $country )); if ( empty( $country ) ) { continue; } // escape data for mysql import $country = mysqli_real_escape_string( $mysqli, $country ); $country = str_replace( "\\'", "''", $country ); $country_key = array_search( $country, $countries ); // verify if country already take and asign unique id number if ( $country_key === false ) { $countries[] = $country; $country_key = array_search( $country, $countries ); } $country_key = ( $country_key + 1 ); } catch ( \InvalidArgumentException $e ) { // data has no country $country = "NULL"; } try { $name = $crawler->filterXPath( '//*[@class="header"]/tbody/tr[2]/td[1]' )->text(); $name = trim(preg_replace( '/\s+/', ' ', $name )); if ( empty( $name ) ) { continue; } $name = "\"" . mysqli_real_escape_string( $mysqli, $name ) . "\""; $name = str_replace( "\\\"", "\"\"", $name ); $name = str_replace( "\\'", "''", $name ); if ( in_array( $name, $match ) ) { continue; } else { $match[] = $name; } } catch ( \InvalidArgumentException $e ) { continue; } // Write data fwrite( $file, $id . "," . $name . "," . $country_key . "\n" ); $id ++; } } fclose( $file ); $csv_head = "\"id\",\"country\""; $file = fopen( 'countries.csv', 'w' ); fwrite( $file, $csv_head . "\n" ); foreach ( $countries as $key => $type ) { fwrite( $file, ( $key + 1 ) . "," . "\"" . $type . "\"" . "\n" ); } fclose( $file ); |
If you got a error like this in the /var/log/apache2/error.log file:
|
1 |
PHP Fatal error: Uncaught RuntimeException: Could not start chrome. Exit code: 1 (General error). Error output: /system.slice/apache2.service is not a snap cgroup |
You can’t start the script with a browser. You have to execute your script on the command line.
If you get this error:
|
1 2 3 |
PHP Fatal error: Uncaught Facebook\WebDriver\Exception\SessionNotCreatedException: session not created: Chrome failed to start: exited normally. (session not created: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/chromium is no longer running, so ChromeDriver is assuming that Chrome has crashed.) |
Try to execute the code as non root.