End-to-end automated pipeline for tracking historical content drift in external documentation.
-
Tracking content changes in non-standardized HTML (like Wikipedia) creates a "Signal-to-Noise" problem. Naive diffing triggers false positives due to dynamic elements like ads, navigation bars, or timestamps. The challenge was to architect a pipeline that isolates semantic content (cleaning the DOM) and automates the entire lifecycle—from scraping to archival to diff reporting—without human intervention.
Automated Content Drift Pipeline
def clean_soup(soup):
"""
Semantic Filtering:
We decompose purely visual or dynamic elements to ensure
the diff engine only compares meaningful content.
"""
content_div = soup.find('div', {'class': 'mw-parser-output'})
if not content_div: return None
# Noise Reduction: Remove dynamic/irrelevant tags
for tag in content_div.find_all(['table', 'script', 'style', 'navbox']):
tag.decompose()
# Advanced filtering for Wikipedia-specific artifacts
classes_to_remove = ['infobox', 'mw-editsection', 'reflist', 'toc']
for cls in classes_to_remove:
for tag in content_div.find_all(class_=cls):
tag.decompose()
return content_divI implemented a "Snapshot & Compare" strategy using tar.gz archives rather than a database. While a database offers faster queries, file-based archiving provides immutable "Source of Truth" snapshots that are easier to debug and transfer. For the comparison logic (diffcheck), I chose to normalize the content to Markdown first (stripping HTML noise) to focus on text changes rather than markup changes.
The pipeline features Graceful Degradation. Network failures (404/500) during ingestion are logged to stderr but do not halt the batch process. The diffcheck utility handles missing daily archives by providing clear diagnostic messages rather than crashing.
Automated the tracking of regulatory documentation changes. By filtering out 90% of HTML noise (ads/navs), the system reduced false positive alerts to near zero. The integration with cron ensures 24/7 monitoring reliability, producing daily diff reports that highlight only semantic modifications.