如何創造一門程式語言:以 Python 為例的深度剖析


引言:程式語言是如何運作的?

當你寫下 print("Hello, World!") 並按下執行,電腦是如何理解這段文字並執行對應操作的?這個問題的答案涉及編譯原理的核心:程式語言的設計與實現

本文將以 Python(特別是 CPython 實現)為例,帶你從零開始理解:

  • 如何將文字轉換為有意義的標記(詞法分析)
  • 如何構建程式的語法結構(語法解析)
  • 如何生成中間表示(字節碼編譯)
  • 如何執行程式(虛擬機)

最後,我們將實作一個簡單的解釋器,讓你親手體驗創造語言的樂趣。


第一章:程式語言處理流程概覽

1.1 從源碼到執行的旅程

graph LR
    subgraph "前端 Frontend"
        SRC[源碼<br/>Source Code] --> LEX[詞法分析<br/>Lexer]
        LEX --> TOK[標記流<br/>Tokens]
        TOK --> PARSE[語法分析<br/>Parser]
        PARSE --> AST[抽象語法樹<br/>AST]
    end

    subgraph "後端 Backend"
        AST --> SEM[語義分析<br/>Semantic]
        SEM --> IR[中間表示<br/>IR/Bytecode]
        IR --> OPT[優化<br/>Optimization]
        OPT --> EXEC[執行<br/>VM/Machine]
    end

    style SRC fill:#e1f5fe
    style EXEC fill:#c8e6c9

1.2 編譯型 vs 解釋型

graph TB
    subgraph "編譯型語言 (C, Go, Rust)"
        C_SRC[源碼] --> C_COMP[編譯器]
        C_COMP --> C_OBJ[目標碼/機器碼]
        C_OBJ --> C_EXEC[直接執行]
    end

    subgraph "解釋型語言 (Python, Ruby)"
        P_SRC[源碼] --> P_COMP[編譯器]
        P_COMP --> P_BC[字節碼]
        P_BC --> P_VM[虛擬機]
        P_VM --> P_EXEC[解釋執行]
    end

    subgraph "JIT 編譯 (Java, PyPy)"
        J_SRC[源碼] --> J_COMP[編譯器]
        J_COMP --> J_BC[字節碼]
        J_BC --> J_JIT[JIT 編譯器]
        J_JIT --> J_NATIVE[本機碼]
        J_NATIVE --> J_EXEC[執行]
    end

Python 採用的是混合模式:先編譯成字節碼,再由虛擬機解釋執行。


第二章:詞法分析 (Lexical Analysis)

2.1 什麼是詞法分析?

詞法分析是將源碼字元流轉換為**標記(Token)**序列的過程。就像閱讀句子時,我們會先識別出每個單字。

# 源碼
x = 10 + 20

# 詞法分析後的標記流
[
    Token(NAME, 'x'),
    Token(EQUAL, '='),
    Token(NUMBER, '10'),
    Token(PLUS, '+'),
    Token(NUMBER, '20'),
    Token(NEWLINE, '\n'),
]

2.2 CPython 的詞法分析器

CPython 的詞法分析器位於 Parser/tokenizer.c

graph TB
    subgraph "Tokenizer 工作流程"
        INPUT[源碼字元流] --> READ[讀取字元]
        READ --> CLASS{字元分類}

        CLASS --> |字母/下劃線| IDENT[識別符/關鍵字]
        CLASS --> |數字| NUM[數字字面量]
        CLASS --> |引號| STR[字串字面量]
        CLASS --> |運算符| OP[運算符]
        CLASS --> |空白| WS[跳過空白]
        CLASS --> |#| COMMENT[跳過註解]

        IDENT --> TOKEN[生成 Token]
        NUM --> TOKEN
        STR --> TOKEN
        OP --> TOKEN

        TOKEN --> READ
    end

CPython Tokenizer 源碼分析

// Parser/tokenizer.c
static int
tok_get(struct tok_state *tok, const char **p_start, const char **p_end)
{
    int c;
    int blankline, nonascii;

    *p_start = NULL;
    *p_end = NULL;

nextline:
    tok->start = NULL;
    blankline = 0;

    /* 跳過空格和縮進 */
    c = tok->buf != NULL ? *tok->cur : EOF;
    while (c == ' ' || c == '\t' || c == '\014') {
        c = tok_nextc(tok);
    }

    /* 設置 token 開始位置 */
    *p_start = tok->cur - 1;
    tok->start = tok->cur - 1;

    /* 根據首字元判斷 token 類型 */
    if (c == EOF) {
        return tok_done(tok, ENDMARKER, p_start, p_end);
    }

    /* 識別符和關鍵字 */
    if (is_potential_identifier_start(c)) {
        /* 讀取完整識別符 */
        int all_lower = (c >= 'a' && c <= 'z');
        while (is_potential_identifier_char(c)) {
            c = tok_nextc(tok);
        }
        tok_backup(tok, c);
        *p_end = tok->cur;

        /* 檢查是否為關鍵字 */
        return tok_done(tok, NAME, p_start, p_end);
    }

    /* 數字字面量 */
    if (c >= '0' && c <= '9') {
        return tok_decimal_tail(tok, c, p_start, p_end);
    }

    /* 字串字面量 */
    if (c == '\'' || c == '"') {
        return tok_string(tok, c, p_start, p_end);
    }

    /* 運算符和分隔符 */
    switch (c) {
        case '+':
            return tok_operator(tok, PLUS, p_start, p_end);
        case '-':
            return tok_operator(tok, MINUS, p_start, p_end);
        case '*':
            c2 = tok_nextc(tok);
            if (c2 == '*') {
                return tok_done(tok, DOUBLESTAR, p_start, p_end);
            }
            tok_backup(tok, c2);
            return tok_done(tok, STAR, p_start, p_end);
        // ... 更多運算符
    }
}

2.3 Python 中使用 tokenize 模組

import tokenize
import io

code = '''
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)
'''

tokens = tokenize.generate_tokens(io.StringIO(code).readline)
for tok in tokens:
    print(f'{tokenize.tok_name[tok.type]:10} {repr(tok.string):15} '
          f'line {tok.start[0]}, col {tok.start[1]}')

輸出:

ENCODING   'utf-8'         line 1, col 0
NL         '\n'            line 1, col 0
NAME       'def'           line 2, col 0
NAME       'factorial'     line 2, col 4
OP         '('             line 2, col 13
NAME       'n'             line 2, col 14
OP         ')'             line 2, col 15
OP         ':'             line 2, col 16
NEWLINE    '\n'            line 2, col 17
INDENT     '    '          line 3, col 0
NAME       'if'            line 3, col 4
NAME       'n'             line 3, col 7
OP         '<='            line 3, col 9
NUMBER     '1'             line 3, col 12
OP         ':'             line 3, col 13
...

2.4 實作一個簡單的 Tokenizer

import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import Iterator

class TokenType(Enum):
    # 字面量
    NUMBER = auto()
    STRING = auto()

    # 識別符和關鍵字
    IDENTIFIER = auto()
    KEYWORD = auto()

    # 運算符
    PLUS = auto()
    MINUS = auto()
    STAR = auto()
    SLASH = auto()
    ASSIGN = auto()
    EQ = auto()
    NE = auto()
    LT = auto()
    GT = auto()
    LE = auto()
    GE = auto()

    # 分隔符
    LPAREN = auto()
    RPAREN = auto()
    LBRACE = auto()
    RBRACE = auto()
    COMMA = auto()
    COLON = auto()
    SEMICOLON = auto()

    # 特殊
    NEWLINE = auto()
    INDENT = auto()
    DEDENT = auto()
    EOF = auto()

@dataclass
class Token:
    type: TokenType
    value: str
    line: int
    column: int

KEYWORDS = {'if', 'else', 'while', 'for', 'def', 'return',
            'and', 'or', 'not', 'True', 'False', 'None'}

class Tokenizer:
    def __init__(self, source: str):
        self.source = source
        self.pos = 0
        self.line = 1
        self.column = 1
        self.indent_stack = [0]

    def current_char(self) -> str | None:
        if self.pos >= len(self.source):
            return None
        return self.source[self.pos]

    def advance(self) -> str:
        char = self.current_char()
        self.pos += 1
        if char == '\n':
            self.line += 1
            self.column = 1
        else:
            self.column += 1
        return char

    def peek(self, offset: int = 1) -> str | None:
        pos = self.pos + offset
        if pos >= len(self.source):
            return None
        return self.source[pos]

    def skip_whitespace(self):
        while self.current_char() in ' \t':
            self.advance()

    def skip_comment(self):
        if self.current_char() == '#':
            while self.current_char() and self.current_char() != '\n':
                self.advance()

    def read_number(self) -> Token:
        start_col = self.column
        result = ''

        # 整數部分
        while self.current_char() and self.current_char().isdigit():
            result += self.advance()

        # 小數部分
        if self.current_char() == '.' and self.peek() and self.peek().isdigit():
            result += self.advance()  # '.'
            while self.current_char() and self.current_char().isdigit():
                result += self.advance()

        return Token(TokenType.NUMBER, result, self.line, start_col)

    def read_string(self) -> Token:
        start_col = self.column
        quote = self.advance()  # 開始引號
        result = ''

        while self.current_char() and self.current_char() != quote:
            if self.current_char() == '\\':
                self.advance()
                escape_char = self.advance()
                escape_map = {'n': '\n', 't': '\t', '\\': '\\', "'": "'", '"': '"'}
                result += escape_map.get(escape_char, escape_char)
            else:
                result += self.advance()

        if self.current_char() == quote:
            self.advance()  # 結束引號

        return Token(TokenType.STRING, result, self.line, start_col)

    def read_identifier(self) -> Token:
        start_col = self.column
        result = ''

        while self.current_char() and (self.current_char().isalnum()
                                        or self.current_char() == '_'):
            result += self.advance()

        if result in KEYWORDS:
            return Token(TokenType.KEYWORD, result, self.line, start_col)

        return Token(TokenType.IDENTIFIER, result, self.line, start_col)

    def tokenize(self) -> Iterator[Token]:
        while self.current_char():
            # 處理行首縮進
            if self.column == 1:
                indent = 0
                while self.current_char() in ' \t':
                    if self.current_char() == ' ':
                        indent += 1
                    else:
                        indent += 4  # Tab = 4 spaces
                    self.advance()

                if self.current_char() not in '\n#':
                    # 生成 INDENT/DEDENT tokens
                    if indent > self.indent_stack[-1]:
                        self.indent_stack.append(indent)
                        yield Token(TokenType.INDENT, '', self.line, 1)
                    while indent < self.indent_stack[-1]:
                        self.indent_stack.pop()
                        yield Token(TokenType.DEDENT, '', self.line, 1)

            self.skip_whitespace()
            self.skip_comment()

            if not self.current_char():
                break

            char = self.current_char()

            # 換行
            if char == '\n':
                yield Token(TokenType.NEWLINE, '\n', self.line, self.column)
                self.advance()
                continue

            # 數字
            if char.isdigit():
                yield self.read_number()
                continue

            # 字串
            if char in '"\'':
                yield self.read_string()
                continue

            # 識別符/關鍵字
            if char.isalpha() or char == '_':
                yield self.read_identifier()
                continue

            # 雙字元運算符
            two_char = char + (self.peek() or '')
            two_char_ops = {
                '==': TokenType.EQ,
                '!=': TokenType.NE,
                '<=': TokenType.LE,
                '>=': TokenType.GE,
            }
            if two_char in two_char_ops:
                yield Token(two_char_ops[two_char], two_char,
                           self.line, self.column)
                self.advance()
                self.advance()
                continue

            # 單字元運算符
            single_char_ops = {
                '+': TokenType.PLUS,
                '-': TokenType.MINUS,
                '*': TokenType.STAR,
                '/': TokenType.SLASH,
                '=': TokenType.ASSIGN,
                '<': TokenType.LT,
                '>': TokenType.GT,
                '(': TokenType.LPAREN,
                ')': TokenType.RPAREN,
                '{': TokenType.LBRACE,
                '}': TokenType.RBRACE,
                ',': TokenType.COMMA,
                ':': TokenType.COLON,
                ';': TokenType.SEMICOLON,
            }

            if char in single_char_ops:
                yield Token(single_char_ops[char], char,
                           self.line, self.column)
                self.advance()
                continue

            raise SyntaxError(f"Unexpected character '{char}' "
                            f"at line {self.line}, column {self.column}")

        # 文件結束,生成剩餘的 DEDENT
        while len(self.indent_stack) > 1:
            self.indent_stack.pop()
            yield Token(TokenType.DEDENT, '', self.line, self.column)

        yield Token(TokenType.EOF, '', self.line, self.column)


# 測試
code = '''
def add(a, b):
    return a + b

x = add(10, 20)
'''

tokenizer = Tokenizer(code)
for token in tokenizer.tokenize():
    print(token)

第三章:語法分析 (Syntax Analysis)

3.1 語法分析的目標

語法分析器(Parser)讀取標記流,根據語法規則構建抽象語法樹(AST)

graph TB
    subgraph "Token Stream"
        T1[x] --> T2[=]
        T2 --> T3[10]
        T3 --> T4[+]
        T4 --> T5[20]
    end

    subgraph "AST"
        ASSIGN[Assign]
        NAME[Name: x]
        ADD[BinOp: +]
        N10[Num: 10]
        N20[Num: 20]

        ASSIGN --> NAME
        ASSIGN --> ADD
        ADD --> N10
        ADD --> N20
    end

    T2 -.-> ASSIGN

3.2 語法規則與 BNF

Python 的語法用**擴展巴科斯範式(EBNF)**描述:

# Grammar/python.gram (PEG 語法)

file: [statements] ENDMARKER

statements: statement+

statement: compound_stmt | simple_stmts

simple_stmts: simple_stmt (';' simple_stmt)* [';'] NEWLINE

simple_stmt:
    | assignment
    | return_stmt
    | raise_stmt
    | 'pass'
    | 'break'
    | 'continue'
    | expression

compound_stmt:
    | function_def
    | if_stmt
    | for_stmt
    | while_stmt
    | class_def

function_def:
    | 'def' NAME '(' [parameters] ')' ':' block

if_stmt:
    | 'if' expression ':' block elif_stmt
    | 'if' expression ':' block [else_block]

expression:
    | disjunction 'if' disjunction 'else' expression
    | disjunction

disjunction:
    | conjunction ('or' conjunction)+
    | conjunction

conjunction:
    | inversion ('and' inversion)+
    | inversion

comparison:
    | sum (comp_op sum)+
    | sum

sum:
    | sum '+' term
    | sum '-' term
    | term

term:
    | term '*' factor
    | term '/' factor
    | factor

factor:
    | '+' factor
    | '-' factor
    | power

power:
    | primary '**' factor
    | primary

primary:
    | primary '.' NAME
    | primary '(' [arguments] ')'
    | primary '[' expression ']'
    | atom

atom:
    | NAME
    | NUMBER
    | STRING
    | 'True'
    | 'False'
    | 'None'
    | '(' expression ')'
    | '[' [list_items] ']'
    | '{' [dict_items] '}'

3.3 解析技術比較

graph TB
    subgraph "Top-Down 解析"
        TD[遞歸下降<br/>Recursive Descent]
        LL[LL(k) 解析器]
        PEG[PEG 解析器]

        TD --> LL
        TD --> PEG
    end

    subgraph "Bottom-Up 解析"
        LR[LR 解析器]
        LALR[LALR(1)]
        GLR[GLR 解析器]

        LR --> LALR
        LR --> GLR
    end

    NOTE[Python 3.9+ 使用 PEG]
    NOTE -.-> PEG

CPython 從 3.9 版本開始使用 PEG 解析器,取代了原來的 LL(1) 解析器。

3.4 CPython PEG 解析器

// Parser/pegen.c
// PEG 解析器生成 AST

mod_ty
_PyPegen_run_parser_from_string(const char *str, int start_rule,
                                 PyObject *filename, int mode,
                                 PyCompilerFlags *flags,
                                 PyArena *arena)
{
    struct tok_state *tok = _PyTokenizer_FromString(str, mode);
    if (tok == NULL) {
        return NULL;
    }

    Parser *p = _PyPegen_Parser_New(tok, start_rule, flags, arena);
    if (p == NULL) {
        _PyTokenizer_Free(tok);
        return NULL;
    }

    mod_ty result = NULL;

    // 根據起始規則選擇解析函數
    switch (start_rule) {
        case Py_file_input:
            result = file_rule(p);
            break;
        case Py_single_input:
            result = interactive_rule(p);
            break;
        case Py_eval_input:
            result = eval_rule(p);
            break;
    }

    _PyPegen_Parser_Free(p);
    _PyTokenizer_Free(tok);

    return result;
}

// 解析 if 語句的規則
// if_stmt: 'if' expression ':' block elif_stmt | 'if' expression ':' block [else_block]
static stmt_ty
if_stmt_rule(Parser *p)
{
    Token *keyword;
    expr_ty condition;
    asdl_stmt_seq *body;
    asdl_stmt_seq *orelse = NULL;

    // 匹配 'if' 關鍵字
    if ((keyword = _PyPegen_expect_token(p, NAME)) == NULL) {
        return NULL;
    }
    if (strcmp(PyBytes_AS_STRING(keyword->bytes), "if") != 0) {
        return NULL;
    }

    // 匹配條件表達式
    if ((condition = expression_rule(p)) == NULL) {
        return NULL;
    }

    // 匹配 ':'
    if (_PyPegen_expect_token(p, COLON) == NULL) {
        return NULL;
    }

    // 匹配程式區塊
    if ((body = block_rule(p)) == NULL) {
        return NULL;
    }

    // 嘗試匹配 elif 或 else
    orelse = elif_stmt_rule(p);
    if (orelse == NULL) {
        orelse = else_block_rule(p);
    }

    // 構建 AST 節點
    return _PyAST_If(condition, body, orelse, EXTRA);
}

3.5 Python AST 結構

import ast

code = '''
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)
'''

tree = ast.parse(code)
print(ast.dump(tree, indent=2))

輸出:

Module(
  body=[
    FunctionDef(
      name='factorial',
      args=arguments(
        posonlyargs=[],
        args=[arg(arg='n')],
        kwonlyargs=[],
        kw_defaults=[],
        defaults=[]),
      body=[
        If(
          test=Compare(
            left=Name(id='n', ctx=Load()),
            ops=[LtE()],
            comparators=[Constant(value=1)]),
          body=[
            Return(value=Constant(value=1))],
          orelse=[]),
        Return(
          value=BinOp(
            left=Name(id='n', ctx=Load()),
            op=Mult(),
            right=Call(
              func=Name(id='factorial', ctx=Load()),
              args=[
                BinOp(
                  left=Name(id='n', ctx=Load()),
                  op=Sub(),
                  right=Constant(value=1))],
              keywords=[])))],
      decorator_list=[])],
  type_ignores=[])

3.6 實作遞歸下降解析器

from dataclasses import dataclass
from typing import Any, Optional
from abc import ABC, abstractmethod

# ========== AST 節點定義 ==========

class ASTNode(ABC):
    pass

@dataclass
class NumberNode(ASTNode):
    value: float

@dataclass
class StringNode(ASTNode):
    value: str

@dataclass
class IdentifierNode(ASTNode):
    name: str

@dataclass
class BinaryOpNode(ASTNode):
    op: str
    left: ASTNode
    right: ASTNode

@dataclass
class UnaryOpNode(ASTNode):
    op: str
    operand: ASTNode

@dataclass
class AssignNode(ASTNode):
    target: str
    value: ASTNode

@dataclass
class IfNode(ASTNode):
    condition: ASTNode
    then_body: list[ASTNode]
    else_body: list[ASTNode]

@dataclass
class WhileNode(ASTNode):
    condition: ASTNode
    body: list[ASTNode]

@dataclass
class FunctionDefNode(ASTNode):
    name: str
    params: list[str]
    body: list[ASTNode]

@dataclass
class FunctionCallNode(ASTNode):
    name: str
    args: list[ASTNode]

@dataclass
class ReturnNode(ASTNode):
    value: Optional[ASTNode]

@dataclass
class PrintNode(ASTNode):
    value: ASTNode

# ========== 解析器實現 ==========

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.pos = 0

    def current_token(self) -> Optional[Token]:
        if self.pos >= len(self.tokens):
            return None
        return self.tokens[self.pos]

    def peek(self, offset: int = 0) -> Optional[Token]:
        pos = self.pos + offset
        if pos >= len(self.tokens):
            return None
        return self.tokens[pos]

    def advance(self) -> Token:
        token = self.current_token()
        self.pos += 1
        return token

    def expect(self, token_type: TokenType, value: str = None) -> Token:
        token = self.current_token()
        if token is None:
            raise SyntaxError(f"Unexpected end of input, expected {token_type}")
        if token.type != token_type:
            raise SyntaxError(f"Expected {token_type}, got {token.type} at line {token.line}")
        if value and token.value != value:
            raise SyntaxError(f"Expected '{value}', got '{token.value}' at line {token.line}")
        return self.advance()

    def match(self, token_type: TokenType, value: str = None) -> bool:
        token = self.current_token()
        if token is None:
            return False
        if token.type != token_type:
            return False
        if value and token.value != value:
            return False
        return True

    # ========== 語法規則 ==========

    def parse(self) -> list[ASTNode]:
        """程式 = 語句列表"""
        statements = []
        while not self.match(TokenType.EOF):
            if self.match(TokenType.NEWLINE):
                self.advance()
                continue
            statements.append(self.parse_statement())
        return statements

    def parse_statement(self) -> ASTNode:
        """語句 = 複合語句 | 簡單語句"""
        token = self.current_token()

        if token.type == TokenType.KEYWORD:
            if token.value == 'def':
                return self.parse_function_def()
            elif token.value == 'if':
                return self.parse_if()
            elif token.value == 'while':
                return self.parse_while()
            elif token.value == 'return':
                return self.parse_return()

        return self.parse_simple_statement()

    def parse_function_def(self) -> FunctionDefNode:
        """函數定義 = 'def' 名稱 '(' 參數列表 ')' ':' 區塊"""
        self.expect(TokenType.KEYWORD, 'def')
        name = self.expect(TokenType.IDENTIFIER).value
        self.expect(TokenType.LPAREN)

        params = []
        while not self.match(TokenType.RPAREN):
            params.append(self.expect(TokenType.IDENTIFIER).value)
            if self.match(TokenType.COMMA):
                self.advance()

        self.expect(TokenType.RPAREN)
        self.expect(TokenType.COLON)
        self.expect(TokenType.NEWLINE)
        self.expect(TokenType.INDENT)

        body = self.parse_block()

        return FunctionDefNode(name, params, body)

    def parse_block(self) -> list[ASTNode]:
        """區塊 = 縮進語句列表"""
        statements = []

        while not self.match(TokenType.DEDENT) and not self.match(TokenType.EOF):
            if self.match(TokenType.NEWLINE):
                self.advance()
                continue
            statements.append(self.parse_statement())

        if self.match(TokenType.DEDENT):
            self.advance()

        return statements

    def parse_if(self) -> IfNode:
        """if 語句 = 'if' 表達式 ':' 區塊 ['else' ':' 區塊]"""
        self.expect(TokenType.KEYWORD, 'if')
        condition = self.parse_expression()
        self.expect(TokenType.COLON)
        self.expect(TokenType.NEWLINE)
        self.expect(TokenType.INDENT)

        then_body = self.parse_block()
        else_body = []

        if self.match(TokenType.KEYWORD) and self.current_token().value == 'else':
            self.advance()
            self.expect(TokenType.COLON)
            self.expect(TokenType.NEWLINE)
            self.expect(TokenType.INDENT)
            else_body = self.parse_block()

        return IfNode(condition, then_body, else_body)

    def parse_while(self) -> WhileNode:
        """while 語句 = 'while' 表達式 ':' 區塊"""
        self.expect(TokenType.KEYWORD, 'while')
        condition = self.parse_expression()
        self.expect(TokenType.COLON)
        self.expect(TokenType.NEWLINE)
        self.expect(TokenType.INDENT)

        body = self.parse_block()

        return WhileNode(condition, body)

    def parse_return(self) -> ReturnNode:
        """return 語句 = 'return' [表達式]"""
        self.expect(TokenType.KEYWORD, 'return')

        value = None
        if not self.match(TokenType.NEWLINE):
            value = self.parse_expression()

        if self.match(TokenType.NEWLINE):
            self.advance()

        return ReturnNode(value)

    def parse_simple_statement(self) -> ASTNode:
        """簡單語句 = 賦值 | 表達式"""
        # 檢查是否為賦值
        if (self.match(TokenType.IDENTIFIER) and
            self.peek(1) and self.peek(1).type == TokenType.ASSIGN):
            name = self.advance().value
            self.advance()  # 跳過 '='
            value = self.parse_expression()
            if self.match(TokenType.NEWLINE):
                self.advance()
            return AssignNode(name, value)

        # 否則是表達式語句
        expr = self.parse_expression()
        if self.match(TokenType.NEWLINE):
            self.advance()
        return expr

    # ========== 表達式解析(運算符優先級) ==========

    def parse_expression(self) -> ASTNode:
        """表達式 = 比較表達式"""
        return self.parse_comparison()

    def parse_comparison(self) -> ASTNode:
        """比較表達式 = 加法 (比較運算符 加法)*"""
        left = self.parse_additive()

        while self.current_token() and self.current_token().type in (
            TokenType.LT, TokenType.GT, TokenType.LE, TokenType.GE,
            TokenType.EQ, TokenType.NE
        ):
            op = self.advance().value
            right = self.parse_additive()
            left = BinaryOpNode(op, left, right)

        return left

    def parse_additive(self) -> ASTNode:
        """加法表達式 = 乘法 (('+' | '-') 乘法)*"""
        left = self.parse_multiplicative()

        while self.current_token() and self.current_token().type in (
            TokenType.PLUS, TokenType.MINUS
        ):
            op = self.advance().value
            right = self.parse_multiplicative()
            left = BinaryOpNode(op, left, right)

        return left

    def parse_multiplicative(self) -> ASTNode:
        """乘法表達式 = 一元表達式 (('*' | '/') 一元表達式)*"""
        left = self.parse_unary()

        while self.current_token() and self.current_token().type in (
            TokenType.STAR, TokenType.SLASH
        ):
            op = self.advance().value
            right = self.parse_unary()
            left = BinaryOpNode(op, left, right)

        return left

    def parse_unary(self) -> ASTNode:
        """一元表達式 = ('+' | '-') 一元表達式 | 主表達式"""
        if self.current_token() and self.current_token().type in (
            TokenType.PLUS, TokenType.MINUS
        ):
            op = self.advance().value
            operand = self.parse_unary()
            return UnaryOpNode(op, operand)

        return self.parse_primary()

    def parse_primary(self) -> ASTNode:
        """主表達式 = 數字 | 字串 | 識別符 | 函數調用 | '(' 表達式 ')'"""
        token = self.current_token()

        if token.type == TokenType.NUMBER:
            self.advance()
            return NumberNode(float(token.value))

        if token.type == TokenType.STRING:
            self.advance()
            return StringNode(token.value)

        if token.type == TokenType.IDENTIFIER:
            name = self.advance().value

            # 檢查是否為函數調用
            if self.match(TokenType.LPAREN):
                self.advance()
                args = []
                while not self.match(TokenType.RPAREN):
                    args.append(self.parse_expression())
                    if self.match(TokenType.COMMA):
                        self.advance()
                self.expect(TokenType.RPAREN)
                return FunctionCallNode(name, args)

            return IdentifierNode(name)

        if token.type == TokenType.KEYWORD:
            if token.value == 'True':
                self.advance()
                return NumberNode(1)
            elif token.value == 'False':
                self.advance()
                return NumberNode(0)
            elif token.value == 'None':
                self.advance()
                return NumberNode(0)

        if token.type == TokenType.LPAREN:
            self.advance()
            expr = self.parse_expression()
            self.expect(TokenType.RPAREN)
            return expr

        raise SyntaxError(f"Unexpected token {token} at line {token.line}")

第四章:字節碼編譯

4.1 什麼是字節碼?

Python 字節碼是一種中間表示,介於源碼和機器碼之間。它是一系列虛擬機指令,由 Python 虛擬機解釋執行。

import dis

def add(a, b):
    return a + b

dis.dis(add)

輸出:

  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

4.2 CPython 字節碼編譯流程

graph LR
    AST[AST] --> SYM[符號表分析<br/>symtable]
    SYM --> CFG[控制流圖<br/>CFG]
    CFG --> BC[字節碼生成<br/>bytecode]
    BC --> OPT[窺孔優化<br/>peephole]
    OPT --> CODE[Code Object]

4.3 Python 字節碼指令

import dis
import opcode

# 查看所有操作碼
print("Python 字節碼操作碼:")
for name, code in sorted(opcode.opmap.items(), key=lambda x: x[1]):
    print(f"  {code:3d}: {name}")

常見指令分類:

# 1. 載入指令
LOAD_CONST      # 載入常量
LOAD_FAST       # 載入局部變量
LOAD_GLOBAL     # 載入全局變量
LOAD_NAME       # 載入名稱
LOAD_ATTR       # 載入屬性

# 2. 存儲指令
STORE_FAST      # 存儲局部變量
STORE_GLOBAL    # 存儲全局變量
STORE_NAME      # 存儲名稱
STORE_ATTR      # 存儲屬性

# 3. 運算指令
BINARY_ADD      # 加法
BINARY_SUBTRACT # 減法
BINARY_MULTIPLY # 乘法
BINARY_TRUE_DIVIDE  # 除法
COMPARE_OP      # 比較

# 4. 控制流指令
POP_JUMP_IF_FALSE   # 條件跳轉
POP_JUMP_IF_TRUE
JUMP_FORWARD        # 向前跳轉
JUMP_ABSOLUTE       # 絕對跳轉

# 5. 函數指令
CALL_FUNCTION   # 調用函數
RETURN_VALUE    # 返回
MAKE_FUNCTION   # 創建函數

# 6. 堆棧操作
POP_TOP         # 彈出棧頂
DUP_TOP         # 複製棧頂
ROT_TWO         # 交換棧頂兩元素

4.4 Code Object 結構

def example(x, y):
    z = x + y
    return z * 2

code = example.__code__

print(f"函數名: {code.co_name}")
print(f"參數數量: {code.co_argcount}")
print(f"局部變量: {code.co_varnames}")
print(f"常量: {code.co_consts}")
print(f"名稱: {code.co_names}")
print(f"字節碼: {code.co_code.hex()}")
print(f"行號表: {code.co_lnotab.hex()}")

輸出:

函數名: example
參數數量: 2
局部變量: ('x', 'y', 'z')
常量: (None, 2)
名稱: ()
字節碼: 7c007c01170090005a02640153
行號表: 0001040106

4.5 CPython 編譯器源碼

// Python/compile.c
// AST 到字節碼的編譯

static int
compiler_visit_stmt(struct compiler *c, stmt_ty s)
{
    switch (s->kind) {
        case FunctionDef_kind:
            return compiler_function(c, s);

        case If_kind:
            return compiler_if(c, s);

        case While_kind:
            return compiler_while(c, s);

        case For_kind:
            return compiler_for(c, s);

        case Return_kind:
            return compiler_return(c, s);

        case Assign_kind:
            return compiler_assign(c, s);

        case Expr_kind:
            return compiler_expr_stmt(c, s);

        // ... 更多語句類型
    }
}

// 編譯 if 語句
static int
compiler_if(struct compiler *c, stmt_ty s)
{
    basicblock *end, *next;

    end = compiler_new_block(c);
    if (end == NULL) {
        return 0;
    }

    // 編譯條件表達式
    VISIT(c, expr, s->v.If.test);

    // 如果有 else 分支
    if (asdl_seq_LEN(s->v.If.orelse)) {
        next = compiler_new_block(c);
        if (next == NULL) {
            return 0;
        }

        // 條件為假時跳轉到 else
        ADDOP_JUMP(c, POP_JUMP_IF_FALSE, next);

        // 編譯 then 分支
        VISIT_SEQ(c, stmt, s->v.If.body);

        // 跳過 else 分支
        ADDOP_JUMP(c, JUMP_FORWARD, end);

        // else 分支
        compiler_use_next_block(c, next);
        VISIT_SEQ(c, stmt, s->v.If.orelse);
    }
    else {
        // 沒有 else,條件為假時跳到結尾
        ADDOP_JUMP(c, POP_JUMP_IF_FALSE, end);

        // 編譯 then 分支
        VISIT_SEQ(c, stmt, s->v.If.body);
    }

    compiler_use_next_block(c, end);
    return 1;
}

// 編譯二元運算
static int
compiler_boolop(struct compiler *c, expr_ty e)
{
    basicblock *end;
    Py_ssize_t i, n;
    asdl_expr_seq *s;

    end = compiler_new_block(c);
    s = e->v.BoolOp.values;
    n = asdl_seq_LEN(s);

    for (i = 0; i < n - 1; i++) {
        VISIT(c, expr, asdl_seq_GET(s, i));

        // and: 短路求值,為假時跳轉
        // or: 短路求值,為真時跳轉
        if (e->v.BoolOp.op == And) {
            ADDOP_JUMP(c, JUMP_IF_FALSE_OR_POP, end);
        }
        else {
            ADDOP_JUMP(c, JUMP_IF_TRUE_OR_POP, end);
        }
    }

    // 最後一個表達式
    VISIT(c, expr, asdl_seq_GET(s, n - 1));

    compiler_use_next_block(c, end);
    return 1;
}

第五章:Python 虛擬機 (PVM)

5.1 虛擬機架構

graph TB
    subgraph "Python Virtual Machine"
        EVAL[求值迴路<br/>Evaluation Loop]

        subgraph "Frame Stack"
            F1[Frame 1<br/>main]
            F2[Frame 2<br/>function]
            F3[Frame 3<br/>nested]
        end

        subgraph "Each Frame"
            STACK[Value Stack]
            LOCALS[Local Variables]
            GLOBALS[Global Variables]
            CODE[Code Object]
        end

        EVAL --> F1
        F1 --> F2
        F2 --> F3

        F3 --> STACK
        F3 --> LOCALS
        F3 --> GLOBALS
        F3 --> CODE
    end

5.2 求值迴路 (Evaluation Loop)

CPython 的核心是 Python/ceval.c 中的求值迴路:

// Python/ceval.c
// 主求值迴路 - 這是 Python 程式執行的核心

PyObject *
_PyEval_EvalFrameDefault(PyThreadState *tstate, PyFrameObject *f, int throwflag)
{
    // 關鍵寄存器
    PyObject **stack_pointer;     // 堆棧指針
    _Py_CODEUNIT *next_instr;     // 指令指針
    int opcode;                   // 當前操作碼
    int oparg;                    // 操作數

    // 快取常用對象
    PyObject *names;              // 名稱元組
    PyObject *consts;             // 常量元組
    PyObject **fastlocals;        // 快速局部變量

    // 初始化
    names = code->co_names;
    consts = code->co_consts;
    fastlocals = f->f_localsplus;
    stack_pointer = f->f_valuestack;
    next_instr = f->f_lasti + 1;

    // 主迴路
    for (;;) {
        // 獲取下一條指令
        opcode = _Py_OPCODE(*next_instr);
        oparg = _Py_OPARG(*next_instr);
        next_instr++;

        // 巨大的 switch 語句處理每個操作碼
        switch (opcode) {

            case LOAD_FAST: {
                // 載入局部變量到堆棧
                PyObject *value = GETLOCAL(oparg);
                if (value == NULL) {
                    goto unbound_local_error;
                }
                Py_INCREF(value);
                PUSH(value);
                DISPATCH();
            }

            case LOAD_CONST: {
                // 載入常量到堆棧
                PyObject *value = GETITEM(consts, oparg);
                Py_INCREF(value);
                PUSH(value);
                DISPATCH();
            }

            case STORE_FAST: {
                // 存儲堆棧頂到局部變量
                PyObject *value = POP();
                SETLOCAL(oparg, value);
                DISPATCH();
            }

            case BINARY_ADD: {
                // 加法運算
                PyObject *right = POP();
                PyObject *left = TOP();
                PyObject *result = PyNumber_Add(left, right);
                Py_DECREF(left);
                Py_DECREF(right);
                SET_TOP(result);
                if (result == NULL) {
                    goto error;
                }
                DISPATCH();
            }

            case COMPARE_OP: {
                // 比較運算
                PyObject *right = POP();
                PyObject *left = TOP();
                PyObject *result = cmp_outcome(tstate, oparg, left, right);
                Py_DECREF(left);
                Py_DECREF(right);
                SET_TOP(result);
                if (result == NULL) {
                    goto error;
                }
                DISPATCH();
            }

            case POP_JUMP_IF_FALSE: {
                // 條件跳轉
                PyObject *cond = POP();
                int is_true = PyObject_IsTrue(cond);
                Py_DECREF(cond);
                if (is_true < 0) {
                    goto error;
                }
                if (!is_true) {
                    JUMPTO(oparg);
                }
                DISPATCH();
            }

            case CALL_FUNCTION: {
                // 函數調用
                PyObject **sp = stack_pointer;
                PyObject *result = call_function(tstate, &sp, oparg);
                stack_pointer = sp;
                PUSH(result);
                if (result == NULL) {
                    goto error;
                }
                DISPATCH();
            }

            case RETURN_VALUE: {
                // 函數返回
                retval = POP();
                goto exit_returning;
            }

            // ... 更多操作碼 (約 100+ 個)
        }
    }

exit_returning:
    return retval;

error:
    // 錯誤處理
    return NULL;
}

5.3 堆棧機 vs 寄存器機

graph TB
    subgraph "堆棧機 (CPython)"
        S1[LOAD a]
        S2[LOAD b]
        S3[ADD]
        S4[STORE c]

        S1 --> S2 --> S3 --> S4

        STACK1["Stack: []"]
        STACK2["Stack: [a]"]
        STACK3["Stack: [a, b]"]
        STACK4["Stack: [a+b]"]
        STACK5["Stack: []"]
    end

    subgraph "寄存器機 (Lua)"
        R1["ADD R3, R1, R2"]
        R2_["MOVE R4, R3"]
    end

Python 使用堆棧機架構:

  • 優點:指令簡單,編譯器實現容易
  • 缺點:需要更多指令,更多記憶體操作

5.4 實作簡單的虛擬機

from enum import IntEnum
from dataclasses import dataclass
from typing import Any

class OpCode(IntEnum):
    # 載入/存儲
    LOAD_CONST = 1
    LOAD_VAR = 2
    STORE_VAR = 3

    # 運算
    ADD = 10
    SUB = 11
    MUL = 12
    DIV = 13
    NEG = 14

    # 比較
    LT = 20
    GT = 21
    LE = 22
    GE = 23
    EQ = 24
    NE = 25

    # 控制流
    JUMP = 30
    JUMP_IF_FALSE = 31
    JUMP_IF_TRUE = 32

    # 函數
    CALL = 40
    RETURN = 41

    # 其他
    PRINT = 50
    POP = 51
    DUP = 52
    HALT = 99

@dataclass
class Instruction:
    opcode: OpCode
    arg: Any = None

class VirtualMachine:
    def __init__(self):
        self.stack: list = []
        self.variables: dict = {}
        self.functions: dict = {}
        self.call_stack: list = []  # (return_address, saved_variables)
        self.ip: int = 0  # 指令指針

    def push(self, value):
        self.stack.append(value)

    def pop(self):
        if not self.stack:
            raise RuntimeError("Stack underflow")
        return self.stack.pop()

    def top(self):
        if not self.stack:
            raise RuntimeError("Stack is empty")
        return self.stack[-1]

    def execute(self, instructions: list[Instruction]):
        self.ip = 0

        while self.ip < len(instructions):
            instr = instructions[self.ip]
            self.ip += 1

            match instr.opcode:
                case OpCode.LOAD_CONST:
                    self.push(instr.arg)

                case OpCode.LOAD_VAR:
                    name = instr.arg
                    if name in self.variables:
                        self.push(self.variables[name])
                    else:
                        raise RuntimeError(f"Undefined variable: {name}")

                case OpCode.STORE_VAR:
                    name = instr.arg
                    self.variables[name] = self.pop()

                case OpCode.ADD:
                    b = self.pop()
                    a = self.pop()
                    self.push(a + b)

                case OpCode.SUB:
                    b = self.pop()
                    a = self.pop()
                    self.push(a - b)

                case OpCode.MUL:
                    b = self.pop()
                    a = self.pop()
                    self.push(a * b)

                case OpCode.DIV:
                    b = self.pop()
                    a = self.pop()
                    self.push(a / b)

                case OpCode.NEG:
                    a = self.pop()
                    self.push(-a)

                case OpCode.LT:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a < b else 0)

                case OpCode.GT:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a > b else 0)

                case OpCode.LE:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a <= b else 0)

                case OpCode.GE:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a >= b else 0)

                case OpCode.EQ:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a == b else 0)

                case OpCode.NE:
                    b = self.pop()
                    a = self.pop()
                    self.push(1 if a != b else 0)

                case OpCode.JUMP:
                    self.ip = instr.arg

                case OpCode.JUMP_IF_FALSE:
                    condition = self.pop()
                    if not condition:
                        self.ip = instr.arg

                case OpCode.JUMP_IF_TRUE:
                    condition = self.pop()
                    if condition:
                        self.ip = instr.arg

                case OpCode.CALL:
                    func_name, arg_count = instr.arg
                    if func_name not in self.functions:
                        raise RuntimeError(f"Undefined function: {func_name}")

                    func_addr, param_names = self.functions[func_name]

                    # 保存調用現場
                    saved_vars = self.variables.copy()
                    self.call_stack.append((self.ip, saved_vars))

                    # 綁定參數
                    args = [self.pop() for _ in range(arg_count)][::-1]
                    self.variables = {}
                    for name, value in zip(param_names, args):
                        self.variables[name] = value

                    # 跳轉到函數
                    self.ip = func_addr

                case OpCode.RETURN:
                    return_value = self.pop() if self.stack else None

                    if self.call_stack:
                        self.ip, self.variables = self.call_stack.pop()
                        if return_value is not None:
                            self.push(return_value)
                    else:
                        return return_value

                case OpCode.PRINT:
                    value = self.pop()
                    print(value)

                case OpCode.POP:
                    self.pop()

                case OpCode.DUP:
                    self.push(self.top())

                case OpCode.HALT:
                    break

        return None

# ========== 編譯器:AST 到字節碼 ==========

class Compiler:
    def __init__(self):
        self.instructions: list[Instruction] = []
        self.functions: dict = {}  # name -> (address, params)

    def emit(self, opcode: OpCode, arg=None) -> int:
        addr = len(self.instructions)
        self.instructions.append(Instruction(opcode, arg))
        return addr

    def compile(self, nodes: list[ASTNode]) -> list[Instruction]:
        # 第一遍:收集函數定義
        for node in nodes:
            if isinstance(node, FunctionDefNode):
                self.compile_function_def(node)

        # 第二遍:編譯主程式
        main_start = len(self.instructions)
        for node in nodes:
            if not isinstance(node, FunctionDefNode):
                self.compile_node(node)

        self.emit(OpCode.HALT)

        return self.instructions, self.functions, main_start

    def compile_function_def(self, node: FunctionDefNode):
        func_addr = len(self.instructions)
        self.functions[node.name] = (func_addr, node.params)

        for stmt in node.body:
            self.compile_node(stmt)

        # 確保函數有返回值
        self.emit(OpCode.LOAD_CONST, None)
        self.emit(OpCode.RETURN)

    def compile_node(self, node: ASTNode):
        match node:
            case NumberNode(value):
                self.emit(OpCode.LOAD_CONST, value)

            case StringNode(value):
                self.emit(OpCode.LOAD_CONST, value)

            case IdentifierNode(name):
                self.emit(OpCode.LOAD_VAR, name)

            case BinaryOpNode(op, left, right):
                self.compile_node(left)
                self.compile_node(right)
                op_map = {
                    '+': OpCode.ADD,
                    '-': OpCode.SUB,
                    '*': OpCode.MUL,
                    '/': OpCode.DIV,
                    '<': OpCode.LT,
                    '>': OpCode.GT,
                    '<=': OpCode.LE,
                    '>=': OpCode.GE,
                    '==': OpCode.EQ,
                    '!=': OpCode.NE,
                }
                self.emit(op_map[op])

            case UnaryOpNode(op, operand):
                self.compile_node(operand)
                if op == '-':
                    self.emit(OpCode.NEG)

            case AssignNode(target, value):
                self.compile_node(value)
                self.emit(OpCode.STORE_VAR, target)

            case IfNode(condition, then_body, else_body):
                self.compile_node(condition)

                if else_body:
                    # if-else
                    else_jump = self.emit(OpCode.JUMP_IF_FALSE, None)

                    for stmt in then_body:
                        self.compile_node(stmt)

                    end_jump = self.emit(OpCode.JUMP, None)

                    # else 分支
                    else_addr = len(self.instructions)
                    self.instructions[else_jump].arg = else_addr

                    for stmt in else_body:
                        self.compile_node(stmt)

                    end_addr = len(self.instructions)
                    self.instructions[end_jump].arg = end_addr
                else:
                    # if only
                    end_jump = self.emit(OpCode.JUMP_IF_FALSE, None)

                    for stmt in then_body:
                        self.compile_node(stmt)

                    end_addr = len(self.instructions)
                    self.instructions[end_jump].arg = end_addr

            case WhileNode(condition, body):
                loop_start = len(self.instructions)

                self.compile_node(condition)
                end_jump = self.emit(OpCode.JUMP_IF_FALSE, None)

                for stmt in body:
                    self.compile_node(stmt)

                self.emit(OpCode.JUMP, loop_start)

                end_addr = len(self.instructions)
                self.instructions[end_jump].arg = end_addr

            case FunctionCallNode(name, args):
                for arg in args:
                    self.compile_node(arg)
                self.emit(OpCode.CALL, (name, len(args)))

            case ReturnNode(value):
                if value:
                    self.compile_node(value)
                else:
                    self.emit(OpCode.LOAD_CONST, None)
                self.emit(OpCode.RETURN)

            case PrintNode(value):
                self.compile_node(value)
                self.emit(OpCode.PRINT)

# ========== 完整解釋器 ==========

class MiniPythonInterpreter:
    def __init__(self):
        self.vm = VirtualMachine()

    def run(self, source: str):
        # 1. 詞法分析
        tokenizer = Tokenizer(source)
        tokens = list(tokenizer.tokenize())

        # 2. 語法分析
        parser = Parser(tokens)
        ast = parser.parse()

        # 3. 編譯
        compiler = Compiler()
        instructions, functions, main_start = compiler.compile(ast)

        # 4. 執行
        self.vm.functions = functions
        self.vm.ip = main_start

        # Debug: 打印字節碼
        print("=== Bytecode ===")
        for i, instr in enumerate(instructions):
            print(f"{i:4d}: {instr.opcode.name:20} {instr.arg}")
        print("================\n")

        return self.vm.execute(instructions)


# 測試
if __name__ == "__main__":
    code = '''
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

result = factorial(5)
print(result)
'''

    interpreter = MiniPythonInterpreter()
    interpreter.run(code)

第六章:Python 對象模型

6.1 一切皆對象

在 Python 中,所有東西都是對象,包括數字、字串、函數、類別,甚至是類型本身。

graph TB
    subgraph "Python 對象層次"
        OBJ[object]
        TYPE[type]

        OBJ --> TYPE
        TYPE --> TYPE

        INT[int]
        STR[str]
        LIST[list]
        FUNC[function]

        OBJ --> INT
        OBJ --> STR
        OBJ --> LIST
        OBJ --> FUNC

        TYPE --> INT
        TYPE --> STR
        TYPE --> LIST
        TYPE --> FUNC
    end

6.2 PyObject 結構

// Include/object.h
// 所有 Python 對象的基礎結構

typedef struct _object {
    Py_ssize_t ob_refcnt;    // 引用計數
    PyTypeObject *ob_type;    // 類型指針
} PyObject;

// 變長對象
typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;       // 元素數量
} PyVarObject;

// 整數對象 (小整數池化)
struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];        // 數字陣列
};

// 字串對象
typedef struct {
    PyObject_VAR_HEAD
    Py_hash_t hash;           // 快取的 hash 值
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
    } state;
    wchar_t *wstr;
} PyASCIIObject;

// 列表對象
typedef struct {
    PyObject_VAR_HEAD
    PyObject **ob_item;       // 元素陣列
    Py_ssize_t allocated;     // 分配的空間
} PyListObject;

// 字典對象
typedef struct {
    PyObject_HEAD
    Py_ssize_t ma_used;       // 使用的條目數
    uint64_t ma_version_tag;  // 版本標籤
    PyDictKeysObject *ma_keys;
    PyObject **ma_values;
} PyDictObject;

6.3 類型對象

// Include/cpython/object.h
typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name;           // 類型名稱

    Py_ssize_t tp_basicsize;       // 基本大小
    Py_ssize_t tp_itemsize;        // 每個元素大小

    // 特殊方法槽
    destructor tp_dealloc;         // __del__
    reprfunc tp_repr;              // __repr__
    PyNumberMethods *tp_as_number; // 數值方法
    PySequenceMethods *tp_as_sequence; // 序列方法
    PyMappingMethods *tp_as_mapping;   // 映射方法

    hashfunc tp_hash;              // __hash__
    ternaryfunc tp_call;           // __call__
    reprfunc tp_str;               // __str__

    getattrofunc tp_getattro;      // __getattr__
    setattrofunc tp_setattro;      // __setattr__

    richcmpfunc tp_richcompare;    // __eq__, __lt__, 等

    Py_ssize_t tp_weaklistoffset;  // 弱引用
    getiterfunc tp_iter;           // __iter__
    iternextfunc tp_iternext;      // __next__

    struct PyMethodDef *tp_methods;    // 方法列表
    struct PyMemberDef *tp_members;    // 成員列表
    struct PyGetSetDef *tp_getset;     // 屬性

    PyTypeObject *tp_base;         // 基類
    PyObject *tp_dict;             // 類型字典
    descrgetfunc tp_descr_get;     // 描述符 get
    descrsetfunc tp_descr_set;     // 描述符 set

    initproc tp_init;              // __init__
    allocfunc tp_alloc;            // 分配
    newfunc tp_new;                // __new__
    freefunc tp_free;              // 釋放

} PyTypeObject;

6.4 方法解析順序 (MRO)

class A:
    def method(self):
        return "A"

class B(A):
    def method(self):
        return "B"

class C(A):
    def method(self):
        return "C"

class D(B, C):
    pass

# MRO 使用 C3 線性化算法
print(D.__mro__)
# (<class 'D'>, <class 'B'>, <class 'C'>, <class 'A'>, <class 'object'>)

d = D()
print(d.method())  # 輸出: B
graph BT
    D --> B
    D --> C
    B --> A
    C --> A
    A --> object

    subgraph "MRO 順序"
        M1[1. D] --> M2[2. B]
        M2 --> M3[3. C]
        M3 --> M4[4. A]
        M4 --> M5[5. object]
    end

第七章:記憶體管理

7.1 引用計數

// Include/object.h
#define Py_INCREF(op) (((PyObject *)(op))->ob_refcnt++)

#define Py_DECREF(op)                           \
    do {                                        \
        PyObject *_py_decref_tmp = (PyObject *)(op); \
        if (--(_py_decref_tmp->ob_refcnt) == 0) { \
            _Py_Dealloc(_py_decref_tmp);        \
        }                                       \
    } while (0)
import sys

a = [1, 2, 3]
print(sys.getrefcount(a))  # 2 (a 和 getrefcount 的參數)

b = a
print(sys.getrefcount(a))  # 3

del b
print(sys.getrefcount(a))  # 2

7.2 垃圾回收 (循環引用)

graph LR
    subgraph "循環引用問題"
        A[Object A<br/>refcount=1]
        B[Object B<br/>refcount=1]

        A --> |引用| B
        B --> |引用| A
    end

    subgraph "標記清除"
        G1[GC 根] --> A2[A<br/>可達]
        A2 --> B2[B<br/>可達]
        C2[C<br/>不可達] --> D2[D<br/>不可達]
        D2 --> C2
    end
// Modules/gcmodule.c
// 分代垃圾回收

#define NUM_GENERATIONS 3

struct gc_generation {
    PyGC_Head head;
    int threshold;  // 觸發閾值
    int count;      // 當前對象數
};

static struct gc_generation generations[NUM_GENERATIONS] = {
    /* 第 0 代 - 新對象 */
    {{(uintptr_t)_GEN_HEAD(0), (uintptr_t)_GEN_HEAD(0)}, 700, 0},
    /* 第 1 代 */
    {{(uintptr_t)_GEN_HEAD(1), (uintptr_t)_GEN_HEAD(1)}, 10, 0},
    /* 第 2 代 - 長壽對象 */
    {{(uintptr_t)_GEN_HEAD(2), (uintptr_t)_GEN_HEAD(2)}, 10, 0},
};

static Py_ssize_t
gc_collect_main(PyThreadState *tstate, int generation)
{
    // 1. 將所有可追蹤對象的 gc_refs 設為 ob_refcnt
    update_refs(young);

    // 2. 遍歷對象,減少被引用對象的 gc_refs
    subtract_refs(young);

    // 3. gc_refs > 0 的對象是可達的,移動到不可回收列表
    gc_list_init(&unreachable);
    move_unreachable(young, &unreachable);

    // 4. 釋放不可達對象
    delete_garbage(&unreachable);

    return n;
}

7.3 記憶體分配器

graph TB
    subgraph "Python 記憶體架構"
        L3[Python Object Allocator<br/>對象分配器]
        L2[Python Memory Allocator<br/>pymalloc]
        L1[System Allocator<br/>malloc/free]
        L0[OS Memory<br/>mmap/brk]

        L3 --> L2
        L2 --> L1
        L1 --> L0
    end

    subgraph "pymalloc 結構"
        ARENA[Arena 256KB]
        POOL1[Pool 4KB]
        POOL2[Pool 4KB]
        BLOCK1[Block 8B]
        BLOCK2[Block 16B]

        ARENA --> POOL1
        ARENA --> POOL2
        POOL1 --> BLOCK1
        POOL2 --> BLOCK2
    end

第八章:進階主題

8.1 描述符協議

class Descriptor:
    def __get__(self, obj, objtype=None):
        print(f"Getting from {obj}")
        return obj._value

    def __set__(self, obj, value):
        print(f"Setting {value} on {obj}")
        obj._value = value

class MyClass:
    attr = Descriptor()

    def __init__(self, value):
        self.attr = value

obj = MyClass(42)
print(obj.attr)

8.2 元類 (Metaclass)

class SingletonMeta(type):
    _instances = {}

    def __call__(cls, *args, **kwargs):
        if cls not in cls._instances:
            cls._instances[cls] = super().__call__(*args, **kwargs)
        return cls._instances[cls]

class Database(metaclass=SingletonMeta):
    def __init__(self):
        print("Creating database connection")

# 只會創建一次
db1 = Database()
db2 = Database()
print(db1 is db2)  # True

8.3 生成器與協程

import dis

def generator_example():
    yield 1
    yield 2
    yield 3

# 查看生成器字節碼
dis.dis(generator_example)

# 生成器使用 YIELD_VALUE 指令
#   0 LOAD_CONST               1 (1)
#   2 YIELD_VALUE
#   4 POP_TOP
#   6 LOAD_CONST               2 (2)
#   8 YIELD_VALUE
#  10 POP_TOP
#  12 LOAD_CONST               3 (3)
#  14 YIELD_VALUE
#  16 POP_TOP
#  18 LOAD_CONST               0 (None)
#  20 RETURN_VALUE

總結

創造一門程式語言涉及多個階段:

graph LR
    A[設計語法] --> B[實現詞法分析器]
    B --> C[實現語法分析器]
    C --> D[構建 AST]
    D --> E[語義分析]
    E --> F[生成中間碼/字節碼]
    F --> G[實現虛擬機/編譯器]
    G --> H[記憶體管理]
    H --> I[標準庫]

Python 的成功在於:

  1. 簡潔的語法:縮進而非大括號
  2. 動態類型:靈活但有成本
  3. 豐富的對象模型:一切皆對象
  4. 自動記憶體管理:引用計數 + 分代 GC
  5. 優秀的生態系統:標準庫和第三方庫

理解這些底層機制,將幫助你:

  • 寫出更高效的 Python 代碼
  • 更好地調試和優化程式
  • 有能力為 Python 貢獻代碼
  • 設計和實現自己的程式語言

參考資源

  1. CPython 源碼
  2. Python Developer’s Guide
  3. PEP 339 - Design of the CPython Compiler
  4. Crafting Interpreters
  5. Green Tree Snakes - the missing Python AST docs