selenium ํ†ตํ•ด์„œ web crawling ํ•ด์„œ slack ๋ฉ”์„ธ์ง€ ๋ณด๋‚ด๊ธฐ
Programming/Python

selenium ํ†ตํ•ด์„œ web crawling ํ•ด์„œ slack ๋ฉ”์„ธ์ง€ ๋ณด๋‚ด๊ธฐ

1. selenium, schedule, requests ์„ค์น˜

- selenium : ์›น ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ œ์–ดํ•œ๋‹ค. ๋ธŒ๋ผ์šฐ์ €์— ์ง์ ‘ ์ ‘๊ทผํ•˜๋‹ˆ ๋™์  ์›น ํŽ˜์ด์ง€๋„์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
- reqeusts : ์ธํ„ฐ๋„ท์—์„œ ํŒŒ์ผ๊ณผ ์›น ํŽ˜์ด์ง€๋ฅผ ๋‹ค์šด๋กœ๋“œ ๊ฐ€๋Šฅํ•˜๋‹ค.
- schedule : ํŠน์ • ์ž‘์—…์„ ์ผ์ •์— ๋งž์ถฐ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค์ • ๊ฐ€๋Šฅํ•˜๋‹ค.
- beautifulsoup : ์›น ํŽ˜์ด์ง€๋ฅผ ์ž‘์„ฑํ•˜๋Š” ํ˜•์‹์ธ HTML์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๋ฉฐ ์ •์  ์›น ํŽ˜์ด์ง€๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
(์ œ๊ฐ€ ํฌ๋กค๋ง ํ•  ์›น ํŽ˜์ด์ง€๋Š” ๋™์  ์›น ํŽ˜์ด์ง€๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— selenium์„ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค ๐Ÿค”)

pip3 install selenium 
pip3 install requests 
pip3 install schedule

 

2. ํŠน์ • URL์˜ HTML ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ(=ํฌ๋กค๋ง)

- chrom driver ๋‹ค์šด๋กœ๋“œ ํ›„ path ์ง€์ • ํ›„ ์ฝ”๋“œ์— dirver ์ƒ์„ฑํ•˜์—ฌ ํฌ๋กค๋ง ํ•  ์›น ํŽ˜์ด์ง€๋ฅผ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค.
(๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ๋„์šธ ์˜ˆ์ •์ด๋ฏ€๋กœ ์‹ค์ œ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋ Œ๋”๋งํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ?! optins.headless = True๋ฅผ ์ ์šฉํ•ด์ค๋‹ˆ๋‹ค.)

#options.headless = True : ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋ Œ๋”๋งํ•˜์ง€ ์•Š๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ๋งŒ ์ž‘์—…์ด ์ด๋ฃจ์–ด์ง€๋„๋ก ํ•˜๊ธฐ ์œ„ํ•œ ์˜ต์…˜ 
options = Options() 
options.headless = True 

driver = webdriver.Chrome(executable_path='./chromedriver', options=options) 
driver.get('https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14')

https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14

 

Security Center Download Detail

 

www.broadcom.com

https://chromedriver.chromium.org/downloads

 

ChromeDriver - WebDriver for Chrome - Downloads

Current Releases If you are using Chrome version 92, please download ChromeDriver 92.0.4515.43 If you are using Chrome version 91, please download ChromeDriver 91.0.4472.101 If you are using Chrome version 90, please download ChromeDriver 90.0.4430.24 If y

chromedriver.chromium.org

 

3. ํŠน์ • ํ•ญ๋ชฉ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

- ์›ํ•˜๋Š” ํ•ญ๋ชฉ์„ ์„ ํƒํ•œ ํ›„ ์˜ค๋ฅธ์ชฝ ํด๋ฆญ - [copy] - [copy selector]๋ฅผ ์„ ํƒ ํ›„ find_element_by_css_selector()์— ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.
(์ €๋Š” ์•„๋ž˜ ์‚ฌ์ดํŠธ์˜ release date๋ฅผ ๋ถˆ๋Ÿฌ์™”์Šต๋‹ˆ๋‹ค. RSS๋Š” ์—†๊ณ  ๋งค์ผ ์‚ฌ์ดํŠธ ๋“ค์–ด๊ฐ€๊ธฐ๋Š” ์‹ซ์–ด์„œ ์ง์ ‘ ๋งŒ๋“ค์—ˆ๋‹ค๋Š”,,, ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ˆ˜๋‹จ์„ ํ†ตํ•ด element๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๊ฐ€ ๋‹ค์–‘ํ•˜๊ฒŒ ์ œ๊ณต๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค!)

#find_element_by_css_selector() : ์ž…๋ ฅํ•œ selector์— ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋“  ์š”์†Œ ์ค‘ ์ฒซ ๋ฒˆ์งธ ์š”์†Œ๋ฅผ ๋ฐ˜ํ™˜ 
time.sleep(3) 
data = driver.find_element_by_css_selector('#DownloadDetail > div.tab-page > div > div.tab-content > div.tab-pane.d-print-block.active > div > div:nth-child(15) > table > tbody > tr > td:nth-child(3)') 
release_date = data.text

 

4. Slack ๋ฉ”์„ธ์ง€ ๋ณด๋‚ด๋Š” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ

- Slack์—์„œ App ์ƒ์„ฑ ํ›„ Webhook URL์„ ๋ฐ›์•„ ๋„ฃ์–ด Slack์œผ๋กœ ๋ฉ”์„ธ์ง€๋ฅผ ๋ณด๋‚ผ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

def send_message(text):
    url = 'https://hooks.slack.com/services/์ƒ๋žต'
    payload = { 'text' : text }
    requests.post(url, json=payload)

 

5. Release Date๋ฅผ ํ™•์ธํ•˜๋Š” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ

- release date์™€ today๋ฅผ ๋น„๊ตํ•˜์—ฌ Slack ๋ฉ”์„ธ์ง€๋ฅผ ๋ฐœ์†กํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ–‰ ํ›„ 0์œผ๋กœ ์ •์˜๋˜์—ˆ๋˜ a ๋ณ€์ˆ˜๋ฅผ 1๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.(์˜ค๋Š˜ ์‹คํ–‰๋œ ์ ์ด ์žˆ์œผ๋ฉด ๊ตณ์ด ๋”์ด์ƒ ์‹คํ–‰๋  ํ•„์š”๊ฐ€ ์—†์œผ๋‹ˆ?! ์ฝ”๋“œ๋ฅผ ์ข…๋ฃŒ์‹œํ‚ค๊ธฐ ์œ„ํ•ฉ์ž…๋‹ˆ๋‹ค!)

def job():
    global a
    a = 1
    if release_date == today:
        send_message("File-Based Protection ํŒจํ„ด์ด ๋ฆด๋ฆฌ์ฆˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

 

6. Schedule ๋งŒ๋“ค๊ธฐ

- 1์‹œ๊ฐ„์— ํ•œ๋ฒˆ ์‹คํ–‰ํ•˜๋Š” schedule์„ ์ƒ์„ฑํ•˜๊ณ  ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. a๋ฅผ ๋น„๊ตํ•˜์—ฌ ์‹คํ–‰๋œ ์ ์ด ์žˆ๋‹ค๋ฉด break, ์—†๋‹ค๋ฉด continue ํ•ฉ๋‹ˆ๋‹ค.
(ํฌ๋กค๋ง ํ•  ํ•ญ๋ชฉ์€ ํ•˜๋ฃจ์— ๋‘๋ฒˆ ๋ฆด๋ฆฌ์ฆˆ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—?! ์š”๋ ‡๊ฒŒ ๋น„๊ต ๋ฐ ์Šค์ผ€์ค„๋ง ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์Šค์ผ€์ค„ ํ•จ์ˆ˜๋Š” ๋ถ„? ์‹œ๊ฐ„? ํŠน์ • ์š”์ผ์˜ ํŠน์ • ์‹œ๊ฐ„ ๋“ฑ ๋‹ค์–‘ํ•˜๊ฒŒ ์ œ๊ณต๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค!)

#schedule.every().hours.do(job) : ๋ช‡์‹œ๊ฐ„ ๋‹จ์œ„๋กœ job์„ ์‹คํ–‰ํ• ์ง€ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. 
schedule.every(1).hours.do(job) 

#schedule.run_pending() : schedule์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. 
while True:
    schedule.run_pending()
    time.sleep(1)
    if a == 1:
        break
    else:
        continue

 

7. ๊ฒฐ๊ณผ

- ๐ŸŽ‰ ์„ฑ๊ณต! ๐ŸŽ‰

import time, requests, schedule
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

a = 0
today = time.strftime('X%m/X%d/%Y', time.localtime(time.time())).replace('X0', 'X').replace('X', '')

options = Options()
options.headless = True
driver = webdriver.Chrome(executable_path='./chromedriver', options=options)
driver.get('https://www.broadcom.com/support/security-center/definitions/download/detail?gid=sep14')

time.sleep(3)
data = driver.find_element_by_css_selector('#DownloadDetail > div.tab-page > div > div.tab-content > div.tab-pane.d-print-block.active > div > div:nth-child(15) > table > tbody > tr > td:nth-child(3)')
release_date = data.text

def send_message(text):
    url = 'https://hooks.slack.com/services/์ƒ๋žต'
    payload = { 'text' : text }
    requests.post(url, json=payload)

def job():
    global a
    a = 1
    if release_date == today:
        send_message("File-Based Protection ํŒจํ„ด์ด ๋ฆด๋ฆฌ์ฆˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

schedule.every(1).minutes.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)
    if a == 1:
        break
    else:
        continue