End-to-end automated pipeline for tracking historical content drift in external documentation.
Tools & Automation Engineer
Tracking content changes in non-standardized HTML (like Wikipedia) creates a "Signal-to-Noise" problem. Naive diffing triggers false positives due to dynamic elements like ads, navigation bars, or timestamps. The challenge was to architect a pipeline that isolates semantic content (cleaning the DOM) and automates the entire lifecycle—from scraping to archival to diff reporting—without human intervention.
Automated Content Drift Pipeline
def clean_soup(soup):
"""
Semantic Filtering:
We decompose purely visual or dynamic elements to ensure
the diff engine only compares meaningful content.
"""
content_div = soup.find('div', {'class': 'mw-parser-output'})
if not content_div: return None
# Noise Reduction: Remove dynamic/irrelevant tags
for tag in content_div.find_all(['table', 'script', 'style', 'navbox']):
tag.decompose()
# Advanced filtering for Wikipedia-specific artifacts
classes_to_remove = ['infobox', 'mw-editsection', 'reflist', 'toc']
for cls in classes_to_remove:
for tag in content_div.find_all(class_=cls):
tag.decompose()
return content_divI implemented a "Snapshot & Compare" strategy using `tar.gz` archives rather than a database. While a database offers faster queries, file-based archiving provides immutable "Source of Truth" snapshots that are easier to debug and transfer. For the comparison logic (`diffcheck`), I chose to normalize the content to Markdown first (stripping HTML noise) to focus on text changes rather than markup changes.
The pipeline features Graceful Degradation. Network failures (404/500) during ingestion are logged to `stderr` but do not halt the batch process. The `diffcheck` utility handles missing daily archives by providing clear diagnostic messages rather than crashing.
Automated the tracking of regulatory documentation changes. By filtering out 90% of HTML noise (ads/navs), the system reduced false positive alerts to near zero. The integration with `cron` ensures 24/7 monitoring reliability, producing daily diff reports that highlight only semantic modifications.