PyMark Renderer

Dependency-free Markdown-to-HTML rendering engine utilizing strict Regex tokenization.

Role

Tech Stack

PythonRegexPytestTDD

The Challenge

Parsing Markdown without external libraries presents a significant Ambiguity Problem. Symbols like * or _ are context-dependent—they can denote a list item, italics, bold text, or literal characters depending on their position. A naive approach fails when formats nest or overlap (e.g., a link containing bold text). The core challenge was designing a processing pipeline that resolves these conflicts deterministically without building a heavy Abstract Syntax Tree (AST).

Architecture & Deep Dive

System Architecture

Sequential Tokenization Pipeline

Key Implementation

python

def apply_inline_formats(line: str) -> str:
    """
    Strict execution order ensures data integrity.
    Links must be processed first to prevent bold/italic markers 
    inside URLs (e.g., underscores) from being corrupted.
    """
    line = convert_link(line)   # Priority 1: Protect URLs
    line = convert_code(line)   # Priority 2: Protect code blocks
    line = convert_emphasis(line) # Priority 3: Formatting
    return line

def convert_emphasis(line: str) -> str:
    # Utilization of Lookbehind (?<!w) ensures we only match 
    # underscores that are strictly borders of words.
    # Matches 'bold' in '__bold__' but ignores 'variable' in 'my_variable_name'
    line = re.sub(r'(?<!w)__([^_]+)__(?!w)', r'<strong>1</strong>', line)
    return line

Technical Trade-offs

I opted for a Regex-based Sequential Pipeline over a full AST parser. While an AST allows for infinite nesting support, a Regex pipeline provides O(n) performance for typical documents and requires zero external dependencies, making it ideal for lightweight embedded script environments. The "Known Limitation" of order dependency was mitigated by enforcing a strict function call hierarchy (apply_inline_formats).

Reliability & Validation

Test Coverage

Engineered a comprehensive `pytest` suite covering 8 distinct edge case categories based on TDD principles.

Error Handling Strategy

Validated Intra-word Protection (test_convert_emphasis_invalid_underscore) to ensure variables like my_variable_name remain unformatted. Ensured Graceful Degradation for unclosed tags (test_convert_code_unclosed, test_convert_emphasis_unclosed), guaranteeing that malformed input renders as raw text rather than crashing the pipeline. Additionally, implemented strict HTML escaping tests (test_convert_paragraph_special_chars) to automatically sanitize special characters like < and &, preventing XSS vulnerabilities.

Impact & Collaboration

Demonstrated mastery of Core Computer Science fundamentals (String Manipulation & Regex) without relying on "Magic Libraries." This lightweight engine was designed to be drop-in compatible for environments where installing pip packages like markdown or pandoc is restricted.

Back to Portfolio