Master Web Scraping with Llama3, Ollama, and ScrapeGraphAI: Your Complete Guide to a Local and Free Solution

       In a constantly evolving web landscape, ScrapeGraphAI introduces a new era of web scraping. This open-source library leverages Large Language Models (LLMs) to offer flexible and low-maintenance scraping solutions for developers.

ScrapeGraphAI is a web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.). The SmartScraperGraph class represents one of the default scraping pipelines, utilizing a direct graph implementation where each node has its own function—from retrieving HTML from a website to extracting relevant information based on your query and generating a coherent answer. This video explores how to scrape website content using LLaMA3, Ollama, and ScrapeGraphAI, all running locally. Note that a minimum of 15GB of RAM is required for this application. This approach can also be a local documents such as XML, HTML, etc, and more. Let's dive into it!


Video:


ScrapeGraphAI  Architecture





Available Scraping Pipeline

















Code

Install these requirements in your terminal

playwright 
scrapegraphai==0.9.0b7 
nest_asyncio
#run playwright install in terminal to download new browsers 


Create a file called app.py and paste this code.

# Import SmartScraperGraph from scrapegraphai.graphs module
from scrapegraphai.graphs import SmartScraperGraph
import nest_asyncio  # Import nest_asyncio module for asynchronous operations
nest_asyncio.apply()  # Apply nest_asyncio to resolve any issues with asyncio event loop

# Configuration dictionary for the graph
graph_config = {
    "llm": {
        "model": "ollama/llama3",  # Specify the model for the llm
        "temperature": 0,  # Set temperature parameter for llm
        "format": "json",  # Specify the output format as JSON for Ollama
        "base_url": "http://localhost:11434",  # Set the base URL for Ollama
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",  # Specify the model for embeddings
        "base_url": "http://localhost:11434",  # Set the base URL for Ollama
    },
    "verbose": True,  # Enable verbose mode for debugging purposes
}

# Initialize SmartScraperGraph with prompt, source, and configuration
smart_scraper_graph = SmartScraperGraph(
    #prompt="List all the content",  # Set prompt for scraping
    prompt="List me all the projects with their descriptions",
    # Source URL or HTML content to scrape
    #source="https://github.com/InsightEdge01",
    source="https://perinim.github.io/projects",
    config=graph_config  # Pass the graph configuration
)

# Run the SmartScraperGraph and store the result
result = smart_scraper_graph.run()

# Print the result
print(result)

# Prettify the result and display the JSON
import json

output = json.dumps(result, indent=2)  # Convert result to JSON format with indentation

line_list = output.split("\n")  # Split the JSON string into lines

# Print each line of the JSON separately
for line in line_list:
    print(line)
 





In your terminal, type "python app.py"






















Post a Comment

Previous Post Next Post