【Python】Seleniumで検索結果をスクレイピングしてCSV出力する

みんたく — Tue, 23 Jul 2019 06:59:09 +0000

Python+Seleniumで検索結果をスクレイピングしてCSV出力する方法を紹介します。とりあえず動かしてみたい方は参考程度にどうぞ。

事前準備

Seleniumをインストール

$ pip install selenium

WebDriverをダウンロードし、格納

以下の公式サイトからChromeのWebDriverをダウンロードします。

sites.google.com

ダウンロードしたらZIPファイルを展開し、「chromedriver.exe」を適当な場所に格納します。

今回はCドライブ直角にdriverファルダを作成し、その配下に格納しました。

C:\driver\chromedriver.exe

検索キーワードからGoogle検索結果をスクレイピングする

・searchGoogleResult.py

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
import time

# Chromeを開く
chrome = webdriver.Chrome("c:/driver/chromedriver.exe")
chrome.get('https://www.google.co.jp/')

# 検索ワード・取得ページ数入力
keyword = input("キーワード入力：")
per = input("何ページ目まで取得するか(数字入力)：")

# 検索実行
search_box = chrome.find_element_by_name("q")
search_words = keyword
search_box.send_keys(" ".join(search_words))
search_box.send_keys(Keys.RETURN)

# 検索結果取得
result = []
for i in range(int(per)):
    try:
        for target_title_url in chrome.find_elements_by_css_selector(".r > a"):
            result.append(target_title_url.text)

        # 次ページ取得・遷移
        next = chrome.find_element_by_css_selector("#navcnt table td.cur + td a")
        next.click()
    except:
        chrome.close()

    # 5秒間スリープ
    time.sleep(5)

chrome.close()

# 検索結果をCSV出力
with open('[' + keyword + ']google_search_result.csv', 'w', newline='', encoding='CP932', errors='replace') as f:
    writer = csv.writer(f)
    writer.writerows([result])

スクレイピングの各処理を解説

必要なモジュールをインポートします。

CSV出力とスリープのため「import csv」、「import time」を追加しています。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
import time

先ほど格納したドライバーを呼び出し、Chromeを立ち上げます。

chrome = webdriver.Chrome("c:/driver/chromedriver.exe")
chrome.get('https://www.google.co.jp/')

inputから検索キーワードと取得ページ数を入力してもらいます。

keyword = input("キーワード入力：")
per = input("何ページ目まで取得するか(数字入力)：")

「chrome.find_element_by_name(“q”)」からname属性がqの検索ボックスを指定し、取得したキーワードの検索を実行します。

search_box = chrome.find_element_by_name("q")
search_box.send_keys(keyword)
search_box.send_keys(Keys.RETURN)

先ほど入力した取得ページ数から検索キーワードの検索結果を格納していきます。今回は、検索結果のタイトルとURLを取得しています。

try・exceptで囲っているため、次ページがない場合は格納処理を終了します。

また、短時間で複数回アクセスするとDos攻撃として認識されることがあるため、sleepメソッドを使って5秒間スリープしています。

result = []
for i in range(int(per)):
    try:
        for target_title_url in chrome.find_elements_by_css_selector(".r > a"):
            result.append(target_title_url.text)
        next = chrome.find_element_by_css_selector("#navcnt table td.cur + td a")
        next.click()
    except:
        chrome.close()

    time.sleep(5)

chrome.close()

検索結果の取得終了後、CSV出力を実行します。

with open('[' + keyword + ']google_search_result.csv', 'w', newline='', encoding='CP932', errors='replace') as f:
    writer = csv.writer(f)
    writer.writerows([result])

実行する際は、作成したファイル直下に移動し、以下のコマンドを叩きます(searchGoogleResult.pyの部分は作成したファイル名に置き換える)。

$ python searchGoogleResult.py

スクレイピングをHeadlessモードで実行する

ブラウザの画面を表示せずに実行する場合は、Headlessモードを有効にすることで実現できます。

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
chrome = webdriver.Chrome("c:/driver/chromedriver.exe", chrome_options=options)

追記：クラス化・関数化・機能追加しました。

selenium-scrape-sample/seleniumScrapeSample.py at master · simanapo/selenium-scrape-sample · GitHub

Contribute to simanapo/selenium-scrape-s…

github.com

The post 【Python】Seleniumで検索結果をスクレイピングしてCSV出力する first appeared on みんたく.