RAG

[RAG] Crawl4AI + LLM์œผ๋กœ ๊ตฌ์ถ•ํ•œ ์›น ์ฝ˜ํ…์ธ (์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ) ์ถ”์ถœ ํŒŒ์ดํ”„๋ผ์ธ(feat. Gemini)

moonzoo 2025. 11. 12. 13:42

https://github.com/unclecode/crawl4ai

 

GitHub - unclecode/crawl4ai: ๐Ÿš€๐Ÿค– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https:/

๐Ÿš€๐Ÿค– Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https://discord.gg/jP8KfhDhyN - unclecode/crawl4ai

github.com

 

ํŒŒ์ด์ฌ์œผ๋กœ ํฌ๋กค๋ง์„ ๊ตฌํ˜„ํ•˜์‹  ๋ถ„๋“ค์€ Selenium์ด๋‚˜ BeautifulSoup์œผ๋กœ ํŠน์ • ์‚ฌ์ดํŠธ ์ „์šฉ ์Šคํฌ๋ ˆํผ๋ฅผ ๋งŒ๋“ค์–ด ๋ณธ ๊ฒฝํ—˜์ด ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์—๋Š” ๋ช…ํ™•ํ•œ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ตฌ์กฐ์  ์ข…์†์„ฑ: ํฌ๋กค๋ง ๋Œ€์ƒ์ด A ์‚ฌ์ดํŠธ์—์„œ B ์‚ฌ์ดํŠธ๋กœ ๋ฐ”๋€Œ๋Š” ์ˆœ๊ฐ„, ๋ชจ๋“  ์ฝ”๋“œ๋Š” ์“ธ๋ชจ์—†์–ด์ง‘๋‹ˆ๋‹ค. div.content-area๊ฐ€ article#main์œผ๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ์ด์œ ๋งŒ์œผ๋กœ, ๋งค๋ฒˆ ์ƒˆ๋กœ์šด ์Šคํฌ๋ ˆํผ๋ฅผ ์ž‘์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ  ๋ถ€์žฌ: ๋” ํฐ ๋ฌธ์ œ๋Š” ์ •๋ณด๊ฐ€ ํ…์ŠคํŠธ๊ฐ€ ์•„๋‹Œ ์ด๋ฏธ์ง€์— ๋‹ด๊ฒจ์žˆ๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. alt ํƒœ๊ทธ๊ฐ€ ๋น„์–ด์žˆ๋Š” ์ฐจํŠธ, ๋ฐฐ๋„ˆ, ๋‹ค์ด์–ด๊ทธ๋žจ์€ ๊ทธ์ € 'ํŒŒ์ผ ๊ฒฝ๋กœ'์ผ ๋ฟ, ๊ทธ ์•ˆ์˜ ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ๋†“์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ €๋Š” ์ด ๋‘ ๊ฐ€์ง€ ํ•œ๊ณ„(๊ตฌ์กฐ์  ์ข…์†์„ฑ, ์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ  ๋ถ€์žฌ)๋ฅผ ๋™์‹œ์— ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ํŠน์ • ์‚ฌ์ดํŠธ์˜ HTML ๊ตฌ์กฐ์— ์˜์กดํ•˜์ง€ ์•Š๋Š” AI ํฌ๋กค๋ง ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค. 

 

์ด ํŒŒ์ดํ”„๋ผ์ธ์€ ๋จผ์ €

1. crawl4ai์˜ ๋”ฅํฌ๋กค๋ง์œผ๋กœ ์‚ฌ์ดํŠธ์˜ ๋ชจ๋“  URL์„ ์ˆ˜์ง‘ํ•˜๊ณ ,

2. ๊ฐ ํŽ˜์ด์ง€์˜ ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ Gemini Vision ๋ชจ๋ธ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

3. ๊ทธ ํ›„, ์›๋ณธ ํ…์ŠคํŠธ์™€ ๋ถ„์„๋œ ์ด๋ฏธ์ง€ ์„ค๋ช…์„ 'ํ•˜๋‚˜์˜ HTML' ๊ตฌ์กฐ๋กœ ๊ฒฐํ•ฉํ•œ ๋’ค,

4. ์ด ํ†ตํ•ฉ ๋ฌธ์„œ๋ฅผ ๋‹ค์‹œ Gemini LLM์—๊ฒŒ ๋„˜๊ฒจ "ํ•ต์‹ฌ ๋ณธ๋ฌธ"๋งŒ ๊ฑธ๋Ÿฌ๋‚ด๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ด ๊ธ€์—์„œ๋Š” crawl4ai, Gemini, BeautifulSoup๋ฅผ ์‚ฌ์šฉํ•ด ์ด ๊ณผ์ •์„ ์–ด๋–ป๊ฒŒ ์ž๋™ํ™”ํ–ˆ๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๊ธฐ์ˆ ์„ ์„ ํƒํ•œ ๊ทผ๊ฑฐ๋Š” ๋ฌด์—‡์ธ์ง€ ์ƒ์„ธํžˆ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ ํฌ๋กค๋ง ํŽ˜์ด์ง€๋Š”... ์ œ ๊นƒํ—ˆ๋ธŒ ํŽ˜์ด์ง€๋กœ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

https://github.com/moonjoo98?tab=repositories

 

moonjoo98 - Overview

moonjoo98 has 29 repositories available. Follow their code on GitHub.

github.com

 

์‚ฌ์šฉํ•œ ํ•ต์‹ฌ ๊ธฐ์ˆ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ํฌ๋กค๋ง ์—”์ง„: crawl4ai (AsyncWebCrawler, LLMContentFilter)
  • AI ๋ชจ๋ธ: Google Gemini 2.5 Flash (VLM ๋ฐ LLM)
  • HTML ์ฒ˜๋ฆฌ: BeautifulSoup
  • ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ: asyncio

ํ”„๋กœ์„ธ์Šค 1: ์‚ฌ์ดํŠธ ๋งตํ•‘ (๋ชจ๋“  ์œ ํšจ URL ์ˆ˜์ง‘)

๊ฐ€์žฅ ๋จผ์ €, ์šฐ๋ฆฌ๊ฐ€ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๋Œ€์ƒ์ด ๋ช‡ ๊ฐœ์ธ์ง€ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. crawl4ai์˜ ๋”ฅํฌ๋กค๋ง ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ดํŠธ ์ „์ฒด๋ฅผ ์Šค์บ”ํ•˜์—ฌ ๋ฐฉ๋ฌธ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํŽ˜์ด์ง€์˜ url๋ฅผ ์ˆ˜์ง‘ํ–ˆ์Šต๋‹ˆ๋‹ค. 

 

์‚ฌ์šฉํ•œ ๊ธฐ์ˆ : crawl4ai์˜ BFSDeepCrawlStrategy (๋„ˆ๋น„ ์šฐ์„  ํƒ์ƒ‰)

  1. ์ฒด๊ณ„์ ์ธ ํƒ์ƒ‰ (BFS > DFS): ๋ชจ๋“  ๋งํฌ๋ฅผ ๋น ์ง์—†์ด ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ผ ๋•Œ, BFS(๋„ˆ๋น„ ์šฐ์„ )๋Š” ๊ฐ€์žฅ ์ฒด๊ณ„์ ์ด๊ณ  ์•ˆ์ •์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 1๋‹จ๊ณ„ ๊นŠ์ด์˜ ๋ชจ๋“  ๋งํฌ๋ฅผ ์ฐพ๊ณ , ๊ทธ๋‹ค์Œ 2๋‹จ๊ณ„ ๊นŠ์ด์˜ ๋ชจ๋“  ๋งํฌ๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹์ด์ฃ . ๋ฐ˜๋ฉด DFS(๊นŠ์ด ์šฐ์„ )๋Š” ํŠน์ • ๊ฒฝ๋กœ์— ๋„ˆ๋ฌด ๊นŠ์ด ๋น ์ ธ(์˜ˆ: ๋ฌดํ•œ ์บ˜๋ฆฐ๋” ํŽ˜์ด์ง€) ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์„น์…˜์„ ๋†“์น  ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ช…ํ™•ํ•œ ๊ฒฝ๊ณ„ ์„ค์ • (include_external=False): ์ €์˜ ๋ชฉํ‘œ๋Š” github.com ๋‚ด๋ถ€ ์ฝ˜ํ…์ธ ์ž…๋‹ˆ๋‹ค. include_external=False ์˜ต์…˜์€ ํฌ๋กค๋Ÿฌ๊ฐ€ ์™ธ๋ถ€ SNS, ๋ธ”๋กœ๊ทธ, ๊ด‘๊ณ  ๋งํฌ๋กœ ๋น ์ ธ๋‚˜๊ฐ€ ์ž์›์„ ๋‚ญ๋น„ํ•˜๋Š” ๊ฒƒ์„ ๋ง‰์•„์ค๋‹ˆ๋‹ค.
  3. ์ž‘์—…์˜ ๋ถ„๋ฆฌ (Separation of Concerns): "URL ์ˆ˜์ง‘"๊ณผ "์ฝ˜ํ…์ธ  ์ฒ˜๋ฆฌ"๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅธ ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด ๋‘ ์ž‘์—…์„ ๋ถ„๋ฆฌํ•˜๋ฉด, URL ์ˆ˜์ง‘์ด ์‹คํŒจํ•˜๋”๋ผ๋„ ์ด๋ฏธ ์ฒ˜๋ฆฌํ•œ ์ฝ˜ํ…์ธ ๋Š” ์•ˆ์ „ํ•˜๋ฉฐ, ๋‚˜์ค‘์— ์ฝ˜ํ…์ธ  ์ฒ˜๋ฆฌ๋งŒ ์žฌ์‹œ๋„ํ•  ์ˆ˜ ์žˆ์–ด ๋งค์šฐ ์•ˆ์ •์ ์ด๊ณ  ํšจ์œจ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์ด ๋ฉ๋‹ˆ๋‹ค.
    • ๋‹จ, CrawlerRunConfig์™€ ๊ฐ™์ด ๋”ฅํฌ๋กค๋ง๊ณผ ํ…์ŠคํŠธ ์ˆ˜์ง‘์„ ๋™์‹œ์— ์ง„ํ–‰ํ•  ์ˆ˜๋„ ์žˆ๋Š”๋ฐ์š”. ์ด๊ฒŒ ํ›จ์”ฌ ๋” ํšจ์œจ์ ์ด๊ณ  ๋น ๋ฅด๊ธด ํ•˜๋‚˜, ์ €๋Š” ์ด๋ฏธ์ง€์— ์ ํ˜€์žˆ๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋ชจ๋‘ ํ•„์š”ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž‘์—…์„ ๋ถ„๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.
    • crawl4ai์—์„œ๋„ Tesseract OCR๊ณผ ๊ฐ™์€ ๊ฒƒ์„ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๊ธดํ•˜๋Š”๋ฐ, Tesseract OCR์€ ํ•œ๊ตญ์–ด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ ๊ฐ™๊ธฐ๋„ํ•˜๊ณ  ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์„ ๋‹ฌ์•„์ค„ ์ˆ˜๋Š” ์—†์–ด์„œ ์ž‘์—…์„ ๋ถ„๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.
# 1๋‹จ๊ณ„: URL ์ˆ˜์ง‘ ์ฝ”๋“œ ์˜ˆ์‹œ
deep_crawl_config = BFSDeepCrawlStrategy(
    max_depth=5,          # ์‚ฌ์ดํŠธ ๊ตฌ์กฐ์— ๋งž์ถฐ ์ ์ ˆํ•œ ๊นŠ์ด
    include_external=False, # ์šฐ๋ฆฌ ๋„๋ฉ”์ธ์—๋งŒ ์ง‘์ค‘
    max_pages=500         # ์„œ๋ฒ„ ๋ถ€๋‹ด์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์•ˆ์ „์žฅ์น˜
)

 

์•„๋ž˜์™€ ๊ฐ™์ด ์‹œ์ž‘ ํŽ˜์ด์ง€ https://github.com/moonjoo98?tab=repositories ์—์„œ ๋ชจ๋“  ๋งํฌ๋ฅผ max_depth = 5 ๊นŒ์ง€ ํƒ์ƒ‰ํ•˜๋ฉด์„œ ๋ชจ๋“  ๋งํฌ๋ฅผ ๋น ์ง์—†์ด ์ฐพ๊ณ  ์žˆ๊ณ , githun.com ๋„๋ฉ”์ธ์˜ URL๋งŒ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    # ํฌ๋กค๋งํ•  ์‹œ์ž‘ URL
    start_url = "https://github.com/moonjoo98?tab=repositories"
    
    # ๋”ฅํฌ๋กค๋ง ์ „๋žต ์„ค์ •
    deep_crawl_config = BFSDeepCrawlStrategy(
        max_depth=5,  
        # include_external=False: ํ•ด๋‹น ๋„๋ฉ”์ธ ๋‚ด์˜ ๋งํฌ๋งŒ ์ˆ˜์ง‘
        include_external=False, 
        max_pages=500 
    )

    # ์ „์ฒด ํฌ๋กค๋Ÿฌ ์‹คํ–‰ ์„ค์ •
    config = CrawlerRunConfig(
        deep_crawl_strategy=deep_crawl_config,
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True  # ํฌ๋กค๋ง ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ฝ˜์†”์— ์ถœ๋ ฅ
    )
    
    # ์ˆ˜์ง‘๋œ ๊ณ ์œ  ๋งํฌ๋ฅผ ์ €์žฅํ•  Set
    collected_links = set()

    print(f"ํฌ๋กค๋ง์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์ƒ: {start_url}")

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun(start_url, config=config)
        
        for result in results:
            if result.url:
                collected_links.add(result.url)

    print(f"\n--- ํฌ๋กค๋ง ์™„๋ฃŒ ---")
    print(f"์ด {len(collected_links)}๊ฐœ์˜ ๊ณ ์œ ํ•œ ๋งํฌ๋ฅผ ์ˆ˜์ง‘ํ–ˆ์Šต๋‹ˆ๋‹ค.")
    
    # ์ˆ˜์ง‘๋œ ๋งํฌ ๋ชฉ๋ก์„ ์ •๋ ฌํ•˜์—ฌ ๋ฐ˜ํ™˜
    return sorted(list(collected_links))

# --- ๋ฉ”์ธ ์‹คํ–‰ ๋ถ€๋ถ„ ---
if __name__ == "__main__":
    # main ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•˜๊ณ  URL ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ›์Œ
    url_list = asyncio.run(main())
    
    # URL ๋ฆฌ์ŠคํŠธ๋ฅผ ํŒŒ์ผ์— ์ €์žฅ
    output_filename = "collected_urls_test.txt"
    try:
        with open(output_filename, "w", encoding="utf-8") as f:
            for url in url_list:
                f.write(url + "\n") # ๊ฐ URL์„ ์ƒˆ ์ค„์— ์ €์žฅ
        
        print(f"'{output_filename}' ํŒŒ์ผ์— {len(url_list)}๊ฐœ์˜ URL์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ €์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.")
        
    except Exception as e:
        print(f"ํŒŒ์ผ ์ €์žฅ ์ค‘ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค: {e}")

 


ํ”„๋กœ์„ธ์Šค 2: ๋™์  HTML ์ฝ˜ํ…์ธ  ํ™•๋ณด

์ˆ˜์ง‘ํ•œ URL์„ ์ด์ œ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง์ ‘ ํ•ด๋ณด์‹œ๋ฉด ์›น์‚ฌ์ดํŠธ๋Š” requests.get()๋งŒ์œผ๋กœ๋Š” ์›ํ•˜๋Š” HTML ๊ตฌ์กฐ๋ฅผ ๋ชจ๋‘ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•„์‹ค๊ฒ๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉํ•œ ๊ธฐ์ˆ : crawl4ai์˜ AsyncWebCrawler + BrowserConfig
  1. ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ๋ Œ๋”๋ง ๋Œ€์‘: ํŠน์ • ์‚ฌ์ดํŠธ๋Š” EgovPageLink.do?link=...์™€ ๊ฐ™์ด URL ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๊ฐ€ ์ฝ˜ํ…์ธ ๋ฅผ ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. requests๋‚˜ httpx ๊ฐ™์€ ๋‹จ์ˆœ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ํ…… ๋นˆ ๊ป๋ฐ๊ธฐ HTML๋งŒ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. BrowserConfig(headless=True)๋Š” crawl4ai๊ฐ€ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์‹ค์ œ ๋ธŒ๋ผ์šฐ์ €(Playwright)๋ฅผ ์‹คํ–‰ํ•˜๋„๋ก ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ธŒ๋ผ์šฐ์ €๋Š” ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋ฅผ ๋ชจ๋‘ ์‹คํ–‰ํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๋ณด๋Š” ์ตœ์ข… ๋ Œ๋”๋ง ๊ฒฐ๊ณผ(HTML)๋ฅผ ์šฐ๋ฆฌ์—๊ฒŒ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  2. ์ถ”์ƒํ™”์˜ ํŽธ๋ฆฌํ•จ: Selenium์ด๋‚˜ Playwright๋ฅผ ์ง์ ‘ ์“ฐ๋ฉด ์ฝ”๋“œ๊ฐ€ ๋งค์šฐ ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค. crawl4ai๋Š” ์ด ๋ณต์žกํ•œ ๋ธŒ๋ผ์šฐ์ € ์ œ์–ด๋ฅผ crawler.arun(url)์ด๋ผ๋Š” ๋‹จ ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ ์ถ”์ƒํ™”ํ•ด ์ค๋‹ˆ๋‹ค.
browser_config = BrowserConfig(headless=True, verbose=False) # ๋ฃจํ”„ ์ค‘์—๋Š” False ๊ถŒ์žฅ
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        delay_before_return_html=2 # html์ด ๋ชจ๋‘ ๋žœ๋”๋ง ๋  ๋•Œ๊นŒ์ง€ ์ง€์—ฐ์‹œ๊ฐ„ ์ถ”๊ฐ€
    )
    
    final_result = { "url": url, "combined_markdown": None }

    try:
        print(f"  [์‹œ์ž‘] ํฌ๋กค๋ง ์‹œ์ž‘: {url}")
        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await crawler.arun(url, config=crawl_config)

ํ”„๋กœ์„ธ์Šค 3: ์ด๋ฏธ์ง€ ์˜๋ฏธ๋ก ์  ๋ถ„์„ (VLM)

HTML์„ ํ™•๋ณดํ–ˆ๋‹ค๋ฉด, ์ด์ œ 'ํ…์ŠคํŠธ'์™€ '์ด๋ฏธ์ง€'๋ฅผ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋Š” ๋‹จ์ˆœํ•œ ํŒŒ์ผ์ด ์•„๋‹ˆ๋ผ ๊ทธ ์ž์ฒด๋กœ ์ค‘์š”ํ•œ ์ฝ˜ํ…์ธ ์ž…๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉํ•œ ๊ธฐ์ˆ : BeautifulSoup + asyncio.gather + Gemini-2.5-Flash (Vision)
  1. ์‹ ๋ขฐํ•  ์ˆ˜ ์—†๋Š” alt ํƒœ๊ทธ: <img> ํƒœ๊ทธ์˜ alt ์†์„ฑ์€ ๋Œ€๋ถ€๋ถ„ ๋น„์–ด์žˆ๊ฑฐ๋‚˜, "๋ฉ”์ธ ๋ฐฐ๋„ˆ", "์•„์ด์ฝ˜"์ฒ˜๋Ÿผ ๋ฌด์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ์ง€์— ์‹ค์ œ๋กœ ๋ฌด์Šจ ๋‚ด์šฉ์ด ์žˆ๋Š”์ง€ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. Gemini๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ๋ณด๊ณ  ๊ทธ ์•ˆ์— ๋‹ด๊ธด ํ…์ŠคํŠธ๋ฅผ ๊ทธ๋Œ€๋กœ ์ธ์‹ํ•˜๊ฑฐ๋‚˜ ํ‘œ, ๋‹ค์ด์–ด๊ทธ๋žจ, ์ƒํ’ˆ ์ด๋ฏธ์ง€ ๋“ฑ์˜ ์„ค๋ช…์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ : ํ•œ ํŽ˜์ด์ง€์— 20๊ฐœ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋‹ค๋ฉด, ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ ์‹œ 1๋ถ„ ์ด์ƒ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. asyncio.gather(*image_tasks)๋Š” 20๊ฐœ์˜ ์ด๋ฏธ์ง€ ๋ถ„์„ ์š”์ฒญ์„ ๋™์‹œ์— Gemini ์„œ๋ฒ„๋กœ ์ „์†กํ•˜๊ณ  ๊ฐ€์žฅ ๋น ๋ฅธ ์ˆœ์„œ๋Œ€๋กœ ์‘๋‹ต์„ ๋ฐ›์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ „์ฒด ํ”„๋กœ์„ธ์Šค ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.
  3. ๋น„์šฉ ๋ฐ ๋…ธ์ด์ฆˆ ์ตœ์ ํ™” (MIN_IMAGE_SIZE): ์›นํŽ˜์ด์ง€์—๋Š” 1x1 ํ”ฝ์…€์งœ๋ฆฌ ์ถ”์ ์šฉ ์ด๋ฏธ์ง€๋‚˜ ์ž‘์€ ์•„์ด์ฝ˜์ด ๋งŽ์Šต๋‹ˆ๋‹ค. if img.width < 100: ๊ฐ™์€ ํ•„ํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, ์˜๋ฏธ ์—†์„ ํ™•๋ฅ ์ด ๋†’์€ ์ž‘์€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ๋Š” VLM API ํ˜ธ์ถœ์„ ์ฐจ๋‹จํ–ˆ์Šต๋‹ˆ๋‹ค.
async def get_image_description_from_llm(image_url: str):
    if not image_url or image_url.startswith("data:"):
        return "๋ฐ์ดํ„ฐ URI ๋˜๋Š” ๋นˆ ์ด๋ฏธ์ง€"
    
    
    model = genai.GenerativeModel('gemini-2.5-flash') 
    
    try:
        headers = {'User-Agent': 'Mozilla/5.0'} 
        response_img = requests.get(image_url, headers=headers, timeout=30) 
        response_img.raise_for_status()
        img = Image.open(BytesIO(response_img.content))

        if img.width < MIN_IMAGE_SIZE or img.height < MIN_IMAGE_SIZE:
            return "" # ํ•„ํ„ฐ๋ง๋จ (๋นˆ ๋ฌธ์ž์—ด)
        
        response_llm = await model.generate_content_async([instruction_for_images, img])

        if not response_llm.parts:
            return "" 
        
        return response_llm.text
    except Exception as e:
        return f"LLM/์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ/์—ด๊ธฐ ์‹คํŒจ: {e}"
async def process_url_structured(url: str):
    """
    ํ•˜๋‚˜์˜ URL์„ ๋ฐ›์•„ ํ…์ŠคํŠธ/์ด๋ฏธ์ง€๋ฅผ ๊ฒฐํ•ฉํ•œ ๋งˆํฌ๋‹ค์šด ๊ฒฐ๊ณผ๋ฅผ
    ํŒŒ์ด์ฌ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    """
    
    browser_config = BrowserConfig(headless=True, verbose=False) # ๋ฃจํ”„ ์ค‘์—๋Š” False ๊ถŒ์žฅ
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        delay_before_return_html=2 
    )
    
    final_result = { "url": url, "combined_markdown": None }

    try:
        print(f"  [์‹œ์ž‘] ํฌ๋กค๋ง ์‹œ์ž‘: {url}")
        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await crawler.arun(url, config=crawl_config)
            
            if not result.success or not result.cleaned_html:
                print(f"  [์‹คํŒจ] HTML์„ ๊ฐ€์ ธ์˜ค์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค: {url}")
                return None # ์‹คํŒจ ์‹œ None ๋ฐ˜ํ™˜

            html_content = result.cleaned_html
            soup = BeautifulSoup(html_content, 'html.parser')

            # --- ๋‹จ๊ณ„ 1: ์ด๋ฏธ์ง€ ํƒœ๊ทธ ์ฐพ๊ธฐ ๋ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ค€๋น„ ---
            print(f"  [์ง„ํ–‰] LLM Vision ์ฒ˜๋ฆฌ (์ด๋ฏธ์ง€)...")
            image_tags = soup.find_all('img')
            
            processed_urls = set()
            image_tasks = []
            url_to_tag_map = {} 

            for img_tag in image_tags:
                img_src = img_tag.get('src')
                if not img_src:
                    img_tag.decompose() # src ์—†๋Š” ํƒœ๊ทธ๋Š” HTML์—์„œ ์ œ๊ฑฐ
                    continue
                
                absolute_img_url = urljoin(url, img_src)
                
                if absolute_img_url in processed_urls:
                    continue # ์ค‘๋ณต URL
                
                processed_urls.add(absolute_img_url)
                
                url_to_tag_map[absolute_img_url] = img_tag
                image_tasks.append(get_image_description_from_llm(absolute_img_url))

            # --- ๋‹จ๊ณ„ 2: ๋ชจ๋“  ์ด๋ฏธ์ง€ LLM ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ---
            if image_tasks:
                print(f"  [์ง„ํ–‰] ๊ณ ์œ  ์ด๋ฏธ์ง€ {len(image_tasks)}๊ฐœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ค‘...")
                image_results = await asyncio.gather(*image_tasks)
                print(f"  [์™„๋ฃŒ] ์ด๋ฏธ์ง€ LLM ์ฒ˜๋ฆฌ ์™„๋ฃŒ.")
            else:
                image_results = []

            # --- ๋‹จ๊ณ„ 3: HTML์— ์ด๋ฏธ์ง€ ์„ค๋ช… ์ถ”๊ฐ€ ---
            for (img_url, img_tag), description in zip(url_to_tag_map.items(), image_results):
                if description and "LLM/์ด๋ฏธ์ง€" not in description: 
                    # [์„ฑ๊ณต] LLM ๊ฒฐ๊ณผ๋ฅผ <img> ํƒœ๊ทธ ๋Œ€์‹  ์‚ฝ์ž…
                    placeholder_text = f"\n\n--- [์ด๋ฏธ์ง€: {img_url}] ---\n{description}\n--- [์ด๋ฏธ์ง€ ๋] ---\n\n"
                    img_tag.replace_with(BeautifulSoup(placeholder_text, 'html.parser'))
                else:
                    # [์‹คํŒจ ๋˜๋Š” ํ•„ํ„ฐ๋ง๋จ] <img> ํƒœ๊ทธ๋ฅผ HTML์—์„œ ๊ทธ๋ƒฅ ์‚ญ์ œ
                    img_tag.decompose()

            # ์ค‘๋ณต ์‚ฌ์šฉ๋œ <img> ํƒœ๊ทธ๋“ค ์ฒ˜๋ฆฌ (์ฒซ ๋ฒˆ์งธ๊ฐ€ ์•„๋‹ˆ์—ˆ๋˜ ํƒœ๊ทธ๋“ค)
            for img_tag in soup.find_all('img'):
                img_tag.decompose() # ๋‚จ์€ <img> ํƒœ๊ทธ ๋ชจ๋‘ ์ œ๊ฑฐ

            modified_html = str(soup)

            # --- ๋‹จ๊ณ„ 4: ์ตœ์ข… ํ…์ŠคํŠธ ํ•„ํ„ฐ๋ง ---
            print(f"  [์ง„ํ–‰] LLM ํ•„ํ„ฐ๋ง (ํ…์ŠคํŠธ + ์ด๋ฏธ์ง€ ์„ค๋ช… ํ†ตํ•ฉ๋ณธ)...")
            filter = LLMContentFilter(
                llm_config=LLMConfig(
                    provider="gemini/gemini-2.5-flash", # ๋ชจ๋ธ๋ช… ์ˆ˜์ •
                    api_token=API_KEY
                ),
                instruction=instruction_to_keep_all, # HTML ๊ตฌ์กฐ์—์„œ ์ถ”์ถœํ•˜๊ณ  ์‹ถ์€ ๋ถ€๋ถ„์„ ํ”„๋กฌํ”„ํŠธ๋กœ ์ „๋‹ฌ.
                verbose=False # ๋ฃจํ”„ ์ค‘์—๋Š” False ๊ถŒ์žฅ
            )
            
            filtered_content_list = filter.filter_content(modified_html)
            final_result["combined_markdown"] = "\n".join(filtered_content_list)
            
            text_snippet = final_result["combined_markdown"][:100].replace("\n", " ")
            print(f"  [๊ฒฐ๊ณผ] ํ…์ŠคํŠธ (100์ž): {text_snippet}...")
            
            return final_result # ์„ฑ๊ณต ์‹œ ๊ฒฐ๊ณผ ๋”•์…”๋„ˆ๋ฆฌ ๋ฐ˜ํ™˜

    except Exception as e:
        print(f"  [์˜ค๋ฅ˜] {url} ์ฒ˜๋ฆฌ ์ค‘ ์˜ˆ์™ธ ๋ฐœ์ƒ: {e}")
        return None # ์‹คํŒจ ์‹œ None ๋ฐ˜ํ™˜

ํ”„๋กœ์„ธ์Šค 4: ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ  ๊ฒฐํ•ฉ

์ด์ œ '์›๋ณธ ํ…์ŠคํŠธ'์™€ '์ด๋ฏธ์ง€ ์„ค๋ช… ํ…์ŠคํŠธ'๋ฅผ ๊ฐ๊ฐ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทผ๋ฐ ์ €๋Š” ์›นํŽ˜์ด์ง€์˜ ๊ตฌ์กฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•œ ์ƒํƒœ๋กœ ํ•˜๋‚˜์˜ ๋งˆํฌ๋‹ค์šด ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ฏธ์ง€ ํƒœ๊ทธ๋ฅผ ์ด๋ฏธ์ง€ OCR ๊ฒฐ๊ณผ๋‚˜ ์„ค๋ช…์œผ๋กœ ๊ต์ฒดํ•ด๋ฒ„๋ ค์„œ ์ตœ๋Œ€ํ•œ ์‹ค์ œ ์›นํŽ˜์ด์ง€์— ๋ณด์ด๋Š” ์ˆœ์„œ๋Œ€๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ฐฐ์—ดํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ ๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ ์ตœ์ข… Output์œผ๋กœ ์ •์ œํ•˜๊ณ  ์‹ถ์–ด url ์ˆ˜์ง‘๊ณผ ์Šคํฌ๋ž˜ํ•‘ ์ž‘์—…์„ ๋ถ„๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉํ•œ ๊ธฐ์ˆ : BeautifulSoup์˜ img_tag.replace_with()

ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ณ„๊ฐœ์˜ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ์›๋ณธ HTML์˜ <img> ํƒœ๊ทธ๋ฅผ VLM์ด ๋ถ„์„ํ•œ "์ด๋ฏธ์ง€ ์„ค๋ช… ํ…์ŠคํŠธ"๋กœ ๊ต์ฒดํ•ด ๋ฒ„๋ ธ์Šต๋‹ˆ๋‹ค.

 

์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

<p>์‹œ์ˆ  ์•ˆ๋‚ด์ž…๋‹ˆ๋‹ค.</p>
<img src="chart.jpg">
<p>์ฃผ์˜์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค.</p>

<p>์‹œ์ˆ  ์•ˆ๋‚ด์ž…๋‹ˆ๋‹ค.</p>
"--- [์ด๋ฏธ์ง€: .../chart.jpg] ---
์ฆ์ƒ(Symptom)๊ณผ ์น˜๋ฃŒ๋ฒ•(Treatment)์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ‘œ.
์ฝ”๋ง‰ํž˜: ์•ฝ๋ฌผ ์น˜๋ฃŒ
--- [์ด๋ฏธ์ง€ ๋] ---"
<p>์ฃผ์˜์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค.</p>

 

์ด์œ : ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด 'ํ…์ŠคํŠธ'์™€ '์ด๋ฏธ์ง€'๋ผ๋Š” ๋‘ ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ "ํ•˜๋‚˜์˜ ์ˆœ์ˆ˜ํ•œ ํ…์ŠคํŠธ ๋ฌธ์„œ"๋กœ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋‹ค์Œ ๋‹จ๊ณ„์˜ LLM์€ ์ด ํ†ตํ•ฉ๋œ ๋ฌธ์„œ๋ฅผ ๋ณด๊ณ  "์•„, ์ด ํ…์ŠคํŠธ ๋‹ค์Œ์— ์ด๋Ÿฐ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์—ˆ๊ตฌ๋‚˜"๋ผ๋ฉฐ ๋ฌธ๋งฅ(Context)์„ ์™„๋ฒฝํ•˜๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์‹ค์ œ๋กœ RAG์—์„œ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•  ๋•Œ๋„ ๊ด€๋ จ ์žˆ๋Š” ์ด๋ฏธ์ง€๊นŒ์ง€ ํฌํ•จํ•ด์„œ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.


ํ”„๋กœ์„ธ์Šค 5: ์ตœ์ข… ์ฝ˜ํ…์ธ  ์ •์ œ (LLM)

์ด์ œ ์šฐ๋ฆฌ๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์„ค๋ช…์ด ํ•ฉ์ณ์ง„, ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ "ํ—ค๋”", "ํ‘ธํ„ฐ", "๋ฉ”๋‰ด", "๊ด‘๊ณ " ๋“ฑ ๋…ธ์ด์ฆˆ๊ฐ€ ๊ฐ€๋“ํ•œ HTML์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉํ•œ ๊ธฐ์ˆ : crawl4ai์˜ LLMContentFilter + ๋งž์ถคํ˜• ํ”„๋กฌํ”„ํŠธ
  1. soup.get_text()์˜ ํ•œ๊ณ„: ๋‹จ์ˆœํžˆ get_text()๋ฅผ ์“ฐ๋ฉด "Copyright", "๋กœ๊ทธ์ธ", "๋งจ ์œ„๋กœ ๊ฐ€๊ธฐ", "๋น ๋ฅธ ์ƒ๋‹ด" ๋“ฑ ์˜จ๊ฐ– UI ํ…์ŠคํŠธ๊ฐ€ ๋’ค์„ž์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.
  2. ํ”„๋กฌํ”„ํŠธ๋Š” = ๋กœ์ง: ์ €๋Š” AI์—๊ฒŒ ๋ช…ํ™•ํ•œ ๊ทœ์น™์„ ์ง€์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. (instruction_to_keep_all๋Š” ํ”„๋กฌํ”„ํŠธ์ž…๋‹ˆ๋‹ค.)
    • ๊ทœ์น™ 1 (๋งํฌ ์ •์ œ): [๋Œ€๋ฆผ์ ](/dl/index.do) ๊ฐ™์€ ๋งˆํฌ๋‹ค์šด ๋งํฌ๋ฅผ "๋Œ€๋ฆผ์ "์ด๋ผ๋Š” ์ˆœ์ˆ˜ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. []()๊ณผ URL์„ ์ œ๊ฑฐํ•˜์—ฌ ์˜๋ฏธ๋งŒ ๋‚จ๊น๋‹ˆ๋‹ค.
    • ๊ทœ์น™ 2 (๋…ธ์ด์ฆˆ ์ œ๊ฑฐ): "๊ณตํ†ต ํ—ค๋”/ํ‘ธํ„ฐ", "์‚ฌ์—…์ž ์ •๋ณด", "Copyright" ๋“ฑ [์ œ์™ธํ•  ๋‚ด์šฉ]์— ๋ช…์‹œ๋œ ๋ชจ๋“  UI ์š”์†Œ๋ฅผ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทœ์น™ 3 (๋ณธ๋ฌธ ๋ณด์กด): [ํฌํ•จํ•  ๋‚ด์šฉ]์— ๋ช…์‹œ๋œ "ํ•ต์‹ฌ ๋ณธ๋ฌธ"๊ณผ "์ด๋ฏธ์ง€ ์„ค๋ช…"์€ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทœ์น™ 4 (๋งˆํฌ๋‹ค์šด ๋ณ€ํ™˜):  LLM์€ ๋งˆํฌ๋‹ค์šด ํฌ๋งท์˜ ๋ฌธ๋งฅ์„ ๋” ์ž˜ ์ดํ•ดํ•œ๋‹ค๋Š” ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ํ…Œ์ด๋ธ”๋„ ๋งˆํฌ๋‹ค์šด์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด LLM์˜ ํ‘œ ์ดํ•ด๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ธฐ๋„ ํ•˜์ฃ . ๊ทธ๋ž˜์„œ HTML ๊ตฌ์กฐ๋ฅผ ๋งˆํฌ๋‹ค์šด ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค.

LLMContentFilter๋Š” ์ด ์ง€์‹œ๋ฌธ์„ ๋ฐ›์•„, ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ modified_html์„ ์ž…๋ ฅ๋ฐ›๊ณ , ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ˆœ์ˆ˜ํ•œ ํ•ต์‹ฌ ์ฝ˜ํ…์ธ (combined_markdown)๋งŒ ๋งˆํฌ๋‹ค์šด ํ˜•์‹์œผ๋กœ ์ถœ๋ ฅํ•ด ์ค๋‹ˆ๋‹ค.


๊ฒฐ๋ก 

์ด ํŒŒ์ดํ”„๋ผ์ธ์€ Selenium ์ฝ”๋“œ๊ฐ€ ํŠน์ • HTML ๊ตฌ์กฐ์— ์ข…์†๋˜๋Š” ํ•œ๊ณ„์™€, ์ด๋ฏธ์ง€ ์ค‘์‹ฌ์˜ ์ฝ˜ํ…์ธ ๋ฅผ ๋†“์น˜๋Š” ํ•œ๊ณ„๋ฅผ crawl4ai์™€ Gemini๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ทน๋ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊ฐ ๋‹จ๊ณ„์—์„œ ์™œ requests ๋Œ€์‹  ๋ธŒ๋ผ์šฐ์ €๋ฅผ, ์™œ alt ํƒœ๊ทธ ๋Œ€์‹  VLM์„, ์™œ ์ˆœ์ฐจ ๋Œ€์‹  ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ, ์™œ get_text() ๋Œ€์‹  LLM ํ•„ํ„ฐ๋ฅผ, ๊ทธ๋ฆฌ๊ณ  ์™œ ๋‹จ์ผ ํŒŒ์ผ ๋Œ€์‹  ๊ฐœ๋ณ„ ํŒŒ์ผ์„ ์„ ํƒํ–ˆ๋Š”์ง€์— ๋Œ€ํ•ด ์ œ๊ฐ€ ์ƒ๊ฐํ•œ ๋ถ€๋ถ„๋“ค๋„ ๊ธ€์— ๋‹ด์•„๋ดค์Šต๋‹ˆ๋‹ค.

 

๋ฌผ๋ก , LLM์œผ๋กœ ์ •์ œํ•˜๊ณ  ์ด๋ฏธ์ง€ ํ•ด์„์„ ๋งก๊ธฐ๋‹ค๋ณด๋‹ˆ ํ• ๋ฃจ์‹œ๋„ค์ด์…˜์ด ๋ฐœ์ƒํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•œ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ์‹๋“ค์„ ์ง์ ‘ ๊ตฌํ˜„ํ•˜์…”์„œ ์‚ฌ์šฉํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ์ด๋ฏธ์ง€ ์„ค๋ช…์€ ์ œ์™ธํ•˜๊ณ  OCR + ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ์ „์ฒด๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด crawl4ai ์ž์ฒด ๊ธฐ๋Šฅ๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํžˆ ๊ตฌํ˜„ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์œผ๋‹ˆ ์•„๋ž˜์˜ ๊ณต์‹ ๋ฌธ์„œ์™€ ์˜ˆ์‹œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ํ•„์š”ํ•œ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์‚ฌ์šฉํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.