PyMark Renderer

Dependency-free Markdown-to-HTML rendering engine utilizing strict Regex tokenization.

Role

Core Engineer

Tech Stack
PythonRegexPytestTDD

The Challenge

Parsing Markdown without external libraries presents a significant Ambiguity Problem. Symbols like `*` or `_` are context-dependent—they can denote a list item, italics, bold text, or literal characters depending on their position. A naive approach fails when formats nest or overlap (e.g., a link containing bold text). The core challenge was designing a processing pipeline that resolves these conflicts deterministically without building a heavy Abstract Syntax Tree (AST).

Architecture & Deep Dive

System Architecture

Sequential Tokenization Pipeline

RawMarkdownBlockSplitterBlockParser(H1/List)InlineProcessorHTMLSanitizerOutputHTML

Key Implementation

python
def apply_inline_formats(line: str) -> str:
    """
    Strict execution order ensures data integrity.
    Links must be processed first to prevent bold/italic markers 
    inside URLs (e.g., underscores) from being corrupted.
    """
    line = convert_link(line)   # Priority 1: Protect URLs
    line = convert_code(line)   # Priority 2: Protect code blocks
    line = convert_emphasis(line) # Priority 3: Formatting
    return line

def convert_emphasis(line: str) -> str:
    # Utilization of Lookbehind (?<!w) ensures we only match 
    # underscores that are strictly borders of words.
    # Matches 'bold' in '__bold__' but ignores 'variable' in 'my_variable_name'
    line = re.sub(r'(?<!w)__([^_]+)__(?!w)', r'<strong>1</strong>', line)
    return line

Technical Trade-offs

I opted for a Regex-based Sequential Pipeline over a full AST parser. While an AST allows for infinite nesting support, a Regex pipeline provides O(n) performance for typical documents and requires zero external dependencies, making it ideal for lightweight embedded script environments. The "Known Limitation" of order dependency was mitigated by enforcing a strict function call hierarchy (`apply_inline_formats`).

Reliability & Validation

Test Coverage

Engineered a comprehensive `pytest` suite covering 8 distinct edge case categories based on TDD principles.

Error Handling Strategy

Validated Intra-word Protection (`test_convert_emphasis_invalid_underscore`) to ensure variables like `my_variable_name` remain unformatted. Ensured Graceful Degradation for unclosed tags (`test_convert_code_unclosed`, `test_convert_emphasis_unclosed`), guaranteeing that malformed input renders as raw text rather than crashing the pipeline. Additionally, implemented strict HTML escaping tests (`test_convert_paragraph_special_chars`) to automatically sanitize special characters like `<` and `&`, preventing XSS vulnerabilities.

Impact & Collaboration

Demonstrated mastery of Core Computer Science fundamentals (String Manipulation & Regex) without relying on "Magic Libraries." This lightweight engine was designed to be drop-in compatible for environments where installing `pip` packages like `markdown` or `pandoc` is restricted.