์›น ํฌ๋กค๋ง(BeautifulSoup)
Programming/Python

์›น ํฌ๋กค๋ง(BeautifulSoup)

1. BeautifulSoup ์„ค์น˜

- ์—ฌ๊ธฐ์„œ BeautifulSoup์€ ๋ฌด์—‡์ธ๊ฐ€? ์˜ˆ์œ ์Šพ...? HTML ๋ฐ XML ํŒŒ์ผ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ ์˜ค๊ธฐ์œ„ํ•œ Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค ์–ด์›์€ '์ด์ƒํ•œ ๋‚˜๋ผ์˜ ์•จ๋ฆฌ์Šค'์—์„œ ์œ ๋ž˜๋˜์—ˆ๋‹ค๊ณ  ํ•˜๊ณ  ์•„๋ฆ„๋‹ต๊ฒŒ ์ •๋ ฌํ•ด์ค€๋‹ค๋Š” ์ •๋„์˜ ์˜๋ฏธ?!

 

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

pip3 install bs4

2. ํŠน์ • URL์˜ HTML ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ(=ํฌ๋กค๋ง)

- ์ œ ๋ธ”๋กœ๊ทธ์— Network Category์˜ HTML ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ Title ์ œ๋ชฉ์„ ๋ฆฌ์ŠคํŠธ์—…ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

context = ssl._create_unverified_context() # SSL ์ธ์ฆ์ด ์ž„์‹œ๋กœ ๊ฑฐ์ณ๊ฐˆ context ํ•„์š”
html = urlopen("https://eunhyee.tistory.com/category/Network", context=context)
BS_html = BeautifulSoup(html, "html.parser")

3. ํŠน์ • ํ•ญ๋ชฉ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(select)

- ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ์™€ jsoup ์‚ฌ์ดํŠธ๋ฅผ ํ†ตํ•˜์—ฌ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ

- div class : article-content์˜ a href๊ฐ€ ๊ฐ Title์˜ URL์ด๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜ ์กฐ๊ฑด์œผ๋กœ ์ถ”์ถœํ•˜์—ฌ list๋กœ ๋งŒ๋“ฆ(๊ตณ์ด Title URL ์ถ”์ถœํ•˜์ง€ ์•Š๊ณ  ํ•ด๋‹น URL์—์„œ๋„ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์ง€๋งŒ ๊ฐ Title URL์— ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๋ฒˆ ๋“ค์–ด๊ฐ€๋ณด๊ฒ ์Šต๋‹ˆ๋‹ท)

for title in BS_html.find_all('div', {'class':'article-content'}):
    url = title.select('a')[0].get('href')
    url_list.append('https://eunhyee.tistory.com' + url)

 

 

Try jsoup online: Java HTML parser and CSS debugger

 

try.jsoup.org

4. ํŠน์ • ํ•ญ๋ชฉ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ(find)

- ์šฐ์„ ์€ ํ•˜๋‚˜๋งŒ ๋ถˆ๋Ÿฌ์™€๋ด…์‹œ๋‹ค h2 tag๋Š” Title URL์˜ ์ œ๋ชฉ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‹ˆ h2 tag๋ฅผ ์ฐพ๊ณ  get_text().strip()์„ ํ†ตํ•˜์—ฌ Title ์ •๋ณด๋งŒ ๋ฐ›์•„์˜ค๊ฒ ์Šต๋‹ˆ๋‹ท (get_text().strip()๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ์—๋Š” 1 <h2 class="title-article">Cisco Systems CCNA(Cisco Certified Network Associate) ์ž๊ฒฉ์ฆ ์ทจ๋“</h2> ์š”๋Ÿฐ ์‹์œผ๋กœ ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค)

- ์ €๋Š” ์ œ๋ชฉ๋งŒ ๋ถˆ๋Ÿฌ์™”์ง€๋งŒ ์ฝ”๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค!

for index, title_url in enumerate(url_list):
    html = urlopen(title_url, context=context)
    BS_html = BeautifulSoup(html, "html.parser")
    title = BS_html.find('h2')
    title = title.get_text().strip()
    print(index+1, title)

5. ๊ฒฐ๊ณผ 

- ๐ŸŽ‰ ์„ฑ๊ณต! ๐ŸŽ‰

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

context = ssl._create_unverified_context()
html = urlopen("https://eunhyee.tistory.com/category/Network", context=context)
BS_html = BeautifulSoup(html, "html.parser")

url_list = []

for title in BS_html.find_all('div', {'class':'article-content'}):
    url = title.select('a')[0].get('href')
    url_list.append('https://eunhyee.tistory.com' + url)

for index, title_url in enumerate(url_list):
    html = urlopen(title_url, context=context)
    BS_html = BeautifulSoup(html, "html.parser")
    title = BS_html.find('h2')
    title = title.get_text().strip()
    print(index+1, title)
1 Cisco Systems CCNA(Cisco Certified Network Associate) ์ž๊ฒฉ์ฆ ์ทจ๋“
2 IPv6
3 Config ์ž๋™ํ™” ํˆด(puppet, chef, salt, ansible)
4 EVE NG ์‹ค์Šต ํ™˜๊ฒฝ ๊ตฌ์ถ•
5 OSI ์ฐธ์กฐ ๋ชจ๋ธ(OSI 7 Layer)๊ณผ TCP/IP
6 PSTN, PSDN
7 ์ •๋ณดํ†ต์‹ ๋ง ๊ฐœ์š”
8 ์ „์†ก๋งค์ฒด
9 ํ†ต์‹ ์˜ ๊ธฐ์ดˆ
10 ์•„๋‚ ๋กœ๊ทธ/๋””์ง€ํ„ธ ์‹ ํ˜ธ
11 ์ •๋ณดํ†ต์‹  ์กฐ์ง
12 NFV
13 SDN
14 TCP์™€ UDP
15 DNS
16 Security
17 ARP Spoofing