Web Content Integrity Monitor

End-to-end automated pipeline for tracking historical content drift in external documentation.

Role

Tech Stack

PythonBeautifulSoupCronDiff Algorithms

The Challenge

Tracking content changes in non-standardized HTML (like Wikipedia) creates a "Signal-to-Noise" problem. Naive diffing triggers false positives due to dynamic elements like ads, navigation bars, or timestamps. The challenge was to architect a pipeline that isolates semantic content (cleaning the DOM) and automates the entire lifecycle—from scraping to archival to diff reporting—without human intervention.

Architecture & Deep Dive

System Architecture

Cron → Scrape → Sanitize → Snapshot → Diff → Report

Key Implementation

python

def clean_soup(soup):
    """
    Semantic Filtering:
    We decompose purely visual or dynamic elements to ensure 
    the diff engine only compares meaningful content.
    """
    content_div = soup.find('div', {'class': 'mw-parser-output'})
    if not content_div: return None

    # Noise Reduction: Remove dynamic/irrelevant tags
    for tag in content_div.find_all(['table', 'script', 'style', 'navbox']):
        tag.decompose()

    # Advanced filtering for Wikipedia-specific artifacts
    classes_to_remove = ['infobox', 'mw-editsection', 'reflist', 'toc']
    for cls in classes_to_remove:
        for tag in content_div.find_all(class_=cls):
            tag.decompose()
            
    return content_div

Technical Trade-offs

I implemented a "Snapshot & Compare" strategy using tar.gz archives rather than a database. While a database offers faster queries, file-based archiving provides immutable "Source of Truth" snapshots that are easier to debug and transfer. For the comparison logic (diffcheck), I chose to normalize the content to Markdown first (stripping HTML noise) to focus on text changes rather than markup changes.

Reliability & Validation

Error Handling Strategy

The pipeline features Graceful Degradation. Network failures (404/500) during ingestion are logged to stderr but do not halt the batch process. The diffcheck utility handles missing daily archives by providing clear diagnostic messages rather than crashing.

Impact & Collaboration

Automated the tracking of regulatory documentation changes. By filtering out 90% of HTML noise (ads/navs), the system reduced false positive alerts to near zero. The integration with cron ensures 24/7 monitoring reliability, producing daily diff reports that highlight only semantic modifications.

Back to Portfolio