In the realm of web scraping, tools that combine power, flexibility, and user-friendliness are rare. Enter Scraperr—an open-source, self-hosted web scraping solution that empowers users to extract data from websites without writing a single line of code.
๐งฐ What is Scraperr?
Scraperr is a self-hosted web application designed to simplify the process of web scraping. It allows users to scrape websites by specifying elements via XPath, manage multiple scraping jobs, and export results in various formats—all through an intuitive interface
✨ Key Features
-
XPath-Based Extraction: Precisely target page elements using XPath selectors.
-
Queue Management: Submit and manage multiple scraping jobs efficiently.
-
Domain Spidering: Option to scrape all pages within the same domain.
-
Custom Headers: Add JSON headers to your scraping requests.
-
Media Downloads: Automatically download images, videos, and other media.
-
Results Visualization: View scraped data in a structured table format.
-
Data Export: Export your results in Markdown and CSV formats.
-
Notification Channels: Receive completion notifications through various channels.
๐ Getting Started with Scraperr
Prerequisites
-
Docker and Docker Compose are installed on your system.
Installation Steps
-
Clone the Repository:
git clone https://github.com/jaypyles/Scraperr.git cd Scraperr
-
Set Up Environment Variables:
Edit the
docker-compose.yml
file to configure environment variables. Default values are provided, but you may customize them as needed.scraperr: environment: - NEXT_PUBLIC_API_URL=http://scraperr_api:8000 - SERVER_URL=http://scraperr_api:8000 scraperr_api: environment: - SECRET_KEY=your_secret_key - ALGORITHM=HS256 - ACCESS_TOKEN_EXPIRE_MINUTES=600
-
Start the Application:
Use the provided Makefile to start the containers.
make up
Once the containers are up and running, access the Scraperr interface at http://localhost
.
๐น️ Submitting a Scraping Job
-
Enter the URL: Input the website URL you wish to scrape.
-
Define Selectors: Specify the data elements to extract using XPath selectors.
-
Submit the Job: Add the job to the queue.
-
Monitor Progress: Track the job status in the Job Table.
-
Download Results: Once completed, download the data in your preferred format.
๐ค AI Integration
Scraperr offers AI integration, allowing you to interact with your scraped data using natural language queries. By integrating with OpenAI's GPT models or Ollama, you can ask questions about your scraping jobs directly within the interface.
Setup:
-
OpenAI:
scraperr_api:
environment:
OPENAI_KEY: your_openai_api_key
OPENAI_MODEL: gpt-4o
-
Ollama:
scraperr_api:
environment:
OLLAMA_URL: http://ollama:11434
OLLAMA_MODEL: phi3:latest
This feature enhances data analysis by providing insights and answers based on your scraped content.
๐ Managing Jobs with the Job Table
The Job Table provides a comprehensive view of all your scraping jobs.
-
Filtering: Search and filter jobs by ID, URL, or status.
-
Actions: Delete, download, or favorite jobs for easy access.
-
AI Chat: Engage with your data through the integrated AI chat feature.
๐งช Advanced Options
For users seeking more control, Scraperr offers advanced configurations: Multi-Page Scraping: Navigate and scrape paginated content.
-
Custom Headers: Simulate different browsers or sessions.
-
Proxy Settings: Route requests through proxies for anonymity or access.
These options can be configured in the docker-compose.yml
file or through the interface, providing flexibility for complex scraping tasks.
๐ ️ Deployment Options
Scraperr supports various deployment methods to suit different environments:
-
Docker: Quick and easy setup using Docker Compose.
-
Helm: Deploy Scraperr on Kubernetes clusters using Helm charts.
Detailed instructions for each deployment method are available in the Scraperr Docs.
๐ Conclusion
Scraperr stands out as a powerful, self-hosted web scraping solution that combines ease of use with advanced features. Whether you're a developer, data analyst, or enthusiast, Scraperr provides the tools you need to extract and analyze web data efficiently.
Explore the project on GitHub and dive into the official documentation to get started.
Happy Scraping!
Comments
Post a Comment