Introducing Scraperr: A Self-Hosted Web Scraping Powerhouse

In the realm of web scraping, tools that combine power, flexibility, and user-friendliness are rare. Enter Scraperr—an open-source, self-hosted web scraping solution that empowers users to extract data from websites without writing a single line of code.

🧰 What is Scraperr?

Scraperr is a self-hosted web application designed to simplify the process of web scraping. It allows users to scrape websites by specifying elements via XPath, manage multiple scraping jobs, and export results in various formats—all through an intuitive interface

✨ Key Features

XPath-Based Extraction: Precisely target page elements using XPath selectors.
Queue Management: Submit and manage multiple scraping jobs efficiently.
Domain Spidering: Option to scrape all pages within the same domain.
Custom Headers: Add JSON headers to your scraping requests.
Media Downloads: Automatically download images, videos, and other media.
Results Visualization: View scraped data in a structured table format.
Data Export: Export your results in Markdown and CSV formats.
Notification Channels: Receive completion notifications through various channels.

🚀 Getting Started with Scraperr

Prerequisites

Docker and Docker Compose are installed on your system.

Installation Steps

Clone the Repository:

git clone https://github.com/jaypyles/Scraperr.git
cd Scraperr

Set Up Environment Variables:

Edit the docker-compose.yml file to configure environment variables. Default values are provided, but you may customize them as needed.

scraperr:
  environment:
    - NEXT_PUBLIC_API_URL=http://scraperr_api:8000
    - SERVER_URL=http://scraperr_api:8000

scraperr_api:
  environment:
    - SECRET_KEY=your_secret_key
    - ALGORITHM=HS256
    - ACCESS_TOKEN_EXPIRE_MINUTES=600

Start the Application:

Use the provided Makefile to start the containers.
```
make up
```

Once the containers are up and running, access the Scraperr interface at http://localhost.

🕹️ Submitting a Scraping Job

Enter the URL: Input the website URL you wish to scrape.
Define Selectors: Specify the data elements to extract using XPath selectors.
Submit the Job: Add the job to the queue.
Monitor Progress: Track the job status in the Job Table.
Download Results: Once completed, download the data in your preferred format.

🤖 AI Integration

Scraperr offers AI integration, allowing you to interact with your scraped data using natural language queries. By integrating with OpenAI's GPT models or Ollama, you can ask questions about your scraping jobs directly within the interface.

Setup:

OpenAI:

  scraperr_api:
    environment:
      OPENAI_KEY: your_openai_api_key
      OPENAI_MODEL: gpt-4o

Ollama:

  scraperr_api:
    environment:
      OLLAMA_URL: http://ollama:11434
      OLLAMA_MODEL: phi3:latest

This feature enhances data analysis by providing insights and answers based on your scraped content.

📊 Managing Jobs with the Job Table

The Job Table provides a comprehensive view of all your scraping jobs.

Filtering: Search and filter jobs by ID, URL, or status.
Actions: Delete, download, or favorite jobs for easy access.
AI Chat: Engage with your data through the integrated AI chat feature.

🧪 Advanced Options

For users seeking more control, Scraperr offers advanced configurations: Multi-Page Scraping: Navigate and scrape paginated content.

Custom Headers: Simulate different browsers or sessions.
Proxy Settings: Route requests through proxies for anonymity or access.

These options can be configured in the docker-compose.yml file or through the interface, providing flexibility for complex scraping tasks.

🛠️ Deployment Options

Scraperr supports various deployment methods to suit different environments:

Docker: Quick and easy setup using Docker Compose.
Helm: Deploy Scraperr on Kubernetes clusters using Helm charts.

Detailed instructions for each deployment method are available in the Scraperr Docs.

📎 Conclusion

Scraperr stands out as a powerful, self-hosted web scraping solution that combines ease of use with advanced features. Whether you're a developer, data analyst, or enthusiast, Scraperr provides the tools you need to extract and analyze web data efficiently.

Explore the project on GitHub and dive into the official documentation to get started.

Happy Scraping!

Search This Blog