如何創造一門程式語言:以 Python 為例的深度剖析
引言:程式語言是如何運作的?
當你寫下 print("Hello, World!") 並按下執行,電腦是如何理解這段文字並執行對應操作的?這個問題的答案涉及編譯原理的核心:程式語言的設計與實現。
本文將以 Python(特別是 CPython 實現)為例,帶你從零開始理解:
- 如何將文字轉換為有意義的標記(詞法分析)
- 如何構建程式的語法結構(語法解析)
- 如何生成中間表示(字節碼編譯)
- 如何執行程式(虛擬機)
最後,我們將實作一個簡單的解釋器,讓你親手體驗創造語言的樂趣。
第一章:程式語言處理流程概覽
1.1 從源碼到執行的旅程
graph LR
subgraph "前端 Frontend"
SRC[源碼<br/>Source Code] --> LEX[詞法分析<br/>Lexer]
LEX --> TOK[標記流<br/>Tokens]
TOK --> PARSE[語法分析<br/>Parser]
PARSE --> AST[抽象語法樹<br/>AST]
end
subgraph "後端 Backend"
AST --> SEM[語義分析<br/>Semantic]
SEM --> IR[中間表示<br/>IR/Bytecode]
IR --> OPT[優化<br/>Optimization]
OPT --> EXEC[執行<br/>VM/Machine]
end
style SRC fill:#e1f5fe
style EXEC fill:#c8e6c9
1.2 編譯型 vs 解釋型
graph TB
subgraph "編譯型語言 (C, Go, Rust)"
C_SRC[源碼] --> C_COMP[編譯器]
C_COMP --> C_OBJ[目標碼/機器碼]
C_OBJ --> C_EXEC[直接執行]
end
subgraph "解釋型語言 (Python, Ruby)"
P_SRC[源碼] --> P_COMP[編譯器]
P_COMP --> P_BC[字節碼]
P_BC --> P_VM[虛擬機]
P_VM --> P_EXEC[解釋執行]
end
subgraph "JIT 編譯 (Java, PyPy)"
J_SRC[源碼] --> J_COMP[編譯器]
J_COMP --> J_BC[字節碼]
J_BC --> J_JIT[JIT 編譯器]
J_JIT --> J_NATIVE[本機碼]
J_NATIVE --> J_EXEC[執行]
end
Python 採用的是混合模式:先編譯成字節碼,再由虛擬機解釋執行。
第二章:詞法分析 (Lexical Analysis)
2.1 什麼是詞法分析?
詞法分析是將源碼字元流轉換為**標記(Token)**序列的過程。就像閱讀句子時,我們會先識別出每個單字。
# 源碼
x = 10 + 20
# 詞法分析後的標記流
[
Token(NAME, 'x'),
Token(EQUAL, '='),
Token(NUMBER, '10'),
Token(PLUS, '+'),
Token(NUMBER, '20'),
Token(NEWLINE, '\n'),
]
2.2 CPython 的詞法分析器
CPython 的詞法分析器位於 Parser/tokenizer.c。
graph TB
subgraph "Tokenizer 工作流程"
INPUT[源碼字元流] --> READ[讀取字元]
READ --> CLASS{字元分類}
CLASS --> |字母/下劃線| IDENT[識別符/關鍵字]
CLASS --> |數字| NUM[數字字面量]
CLASS --> |引號| STR[字串字面量]
CLASS --> |運算符| OP[運算符]
CLASS --> |空白| WS[跳過空白]
CLASS --> |#| COMMENT[跳過註解]
IDENT --> TOKEN[生成 Token]
NUM --> TOKEN
STR --> TOKEN
OP --> TOKEN
TOKEN --> READ
end
CPython Tokenizer 源碼分析
// Parser/tokenizer.c
static int
tok_get(struct tok_state *tok, const char **p_start, const char **p_end)
{
int c;
int blankline, nonascii;
*p_start = NULL;
*p_end = NULL;
nextline:
tok->start = NULL;
blankline = 0;
/* 跳過空格和縮進 */
c = tok->buf != NULL ? *tok->cur : EOF;
while (c == ' ' || c == '\t' || c == '\014') {
c = tok_nextc(tok);
}
/* 設置 token 開始位置 */
*p_start = tok->cur - 1;
tok->start = tok->cur - 1;
/* 根據首字元判斷 token 類型 */
if (c == EOF) {
return tok_done(tok, ENDMARKER, p_start, p_end);
}
/* 識別符和關鍵字 */
if (is_potential_identifier_start(c)) {
/* 讀取完整識別符 */
int all_lower = (c >= 'a' && c <= 'z');
while (is_potential_identifier_char(c)) {
c = tok_nextc(tok);
}
tok_backup(tok, c);
*p_end = tok->cur;
/* 檢查是否為關鍵字 */
return tok_done(tok, NAME, p_start, p_end);
}
/* 數字字面量 */
if (c >= '0' && c <= '9') {
return tok_decimal_tail(tok, c, p_start, p_end);
}
/* 字串字面量 */
if (c == '\'' || c == '"') {
return tok_string(tok, c, p_start, p_end);
}
/* 運算符和分隔符 */
switch (c) {
case '+':
return tok_operator(tok, PLUS, p_start, p_end);
case '-':
return tok_operator(tok, MINUS, p_start, p_end);
case '*':
c2 = tok_nextc(tok);
if (c2 == '*') {
return tok_done(tok, DOUBLESTAR, p_start, p_end);
}
tok_backup(tok, c2);
return tok_done(tok, STAR, p_start, p_end);
// ... 更多運算符
}
}
2.3 Python 中使用 tokenize 模組
import tokenize
import io
code = '''
def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)
'''
tokens = tokenize.generate_tokens(io.StringIO(code).readline)
for tok in tokens:
print(f'{tokenize.tok_name[tok.type]:10} {repr(tok.string):15} '
f'line {tok.start[0]}, col {tok.start[1]}')
輸出:
ENCODING 'utf-8' line 1, col 0
NL '\n' line 1, col 0
NAME 'def' line 2, col 0
NAME 'factorial' line 2, col 4
OP '(' line 2, col 13
NAME 'n' line 2, col 14
OP ')' line 2, col 15
OP ':' line 2, col 16
NEWLINE '\n' line 2, col 17
INDENT ' ' line 3, col 0
NAME 'if' line 3, col 4
NAME 'n' line 3, col 7
OP '<=' line 3, col 9
NUMBER '1' line 3, col 12
OP ':' line 3, col 13
...
2.4 實作一個簡單的 Tokenizer
import re
from dataclasses import dataclass
from enum import Enum, auto
from typing import Iterator
class TokenType(Enum):
# 字面量
NUMBER = auto()
STRING = auto()
# 識別符和關鍵字
IDENTIFIER = auto()
KEYWORD = auto()
# 運算符
PLUS = auto()
MINUS = auto()
STAR = auto()
SLASH = auto()
ASSIGN = auto()
EQ = auto()
NE = auto()
LT = auto()
GT = auto()
LE = auto()
GE = auto()
# 分隔符
LPAREN = auto()
RPAREN = auto()
LBRACE = auto()
RBRACE = auto()
COMMA = auto()
COLON = auto()
SEMICOLON = auto()
# 特殊
NEWLINE = auto()
INDENT = auto()
DEDENT = auto()
EOF = auto()
@dataclass
class Token:
type: TokenType
value: str
line: int
column: int
KEYWORDS = {'if', 'else', 'while', 'for', 'def', 'return',
'and', 'or', 'not', 'True', 'False', 'None'}
class Tokenizer:
def __init__(self, source: str):
self.source = source
self.pos = 0
self.line = 1
self.column = 1
self.indent_stack = [0]
def current_char(self) -> str | None:
if self.pos >= len(self.source):
return None
return self.source[self.pos]
def advance(self) -> str:
char = self.current_char()
self.pos += 1
if char == '\n':
self.line += 1
self.column = 1
else:
self.column += 1
return char
def peek(self, offset: int = 1) -> str | None:
pos = self.pos + offset
if pos >= len(self.source):
return None
return self.source[pos]
def skip_whitespace(self):
while self.current_char() in ' \t':
self.advance()
def skip_comment(self):
if self.current_char() == '#':
while self.current_char() and self.current_char() != '\n':
self.advance()
def read_number(self) -> Token:
start_col = self.column
result = ''
# 整數部分
while self.current_char() and self.current_char().isdigit():
result += self.advance()
# 小數部分
if self.current_char() == '.' and self.peek() and self.peek().isdigit():
result += self.advance() # '.'
while self.current_char() and self.current_char().isdigit():
result += self.advance()
return Token(TokenType.NUMBER, result, self.line, start_col)
def read_string(self) -> Token:
start_col = self.column
quote = self.advance() # 開始引號
result = ''
while self.current_char() and self.current_char() != quote:
if self.current_char() == '\\':
self.advance()
escape_char = self.advance()
escape_map = {'n': '\n', 't': '\t', '\\': '\\', "'": "'", '"': '"'}
result += escape_map.get(escape_char, escape_char)
else:
result += self.advance()
if self.current_char() == quote:
self.advance() # 結束引號
return Token(TokenType.STRING, result, self.line, start_col)
def read_identifier(self) -> Token:
start_col = self.column
result = ''
while self.current_char() and (self.current_char().isalnum()
or self.current_char() == '_'):
result += self.advance()
if result in KEYWORDS:
return Token(TokenType.KEYWORD, result, self.line, start_col)
return Token(TokenType.IDENTIFIER, result, self.line, start_col)
def tokenize(self) -> Iterator[Token]:
while self.current_char():
# 處理行首縮進
if self.column == 1:
indent = 0
while self.current_char() in ' \t':
if self.current_char() == ' ':
indent += 1
else:
indent += 4 # Tab = 4 spaces
self.advance()
if self.current_char() not in '\n#':
# 生成 INDENT/DEDENT tokens
if indent > self.indent_stack[-1]:
self.indent_stack.append(indent)
yield Token(TokenType.INDENT, '', self.line, 1)
while indent < self.indent_stack[-1]:
self.indent_stack.pop()
yield Token(TokenType.DEDENT, '', self.line, 1)
self.skip_whitespace()
self.skip_comment()
if not self.current_char():
break
char = self.current_char()
# 換行
if char == '\n':
yield Token(TokenType.NEWLINE, '\n', self.line, self.column)
self.advance()
continue
# 數字
if char.isdigit():
yield self.read_number()
continue
# 字串
if char in '"\'':
yield self.read_string()
continue
# 識別符/關鍵字
if char.isalpha() or char == '_':
yield self.read_identifier()
continue
# 雙字元運算符
two_char = char + (self.peek() or '')
two_char_ops = {
'==': TokenType.EQ,
'!=': TokenType.NE,
'<=': TokenType.LE,
'>=': TokenType.GE,
}
if two_char in two_char_ops:
yield Token(two_char_ops[two_char], two_char,
self.line, self.column)
self.advance()
self.advance()
continue
# 單字元運算符
single_char_ops = {
'+': TokenType.PLUS,
'-': TokenType.MINUS,
'*': TokenType.STAR,
'/': TokenType.SLASH,
'=': TokenType.ASSIGN,
'<': TokenType.LT,
'>': TokenType.GT,
'(': TokenType.LPAREN,
')': TokenType.RPAREN,
'{': TokenType.LBRACE,
'}': TokenType.RBRACE,
',': TokenType.COMMA,
':': TokenType.COLON,
';': TokenType.SEMICOLON,
}
if char in single_char_ops:
yield Token(single_char_ops[char], char,
self.line, self.column)
self.advance()
continue
raise SyntaxError(f"Unexpected character '{char}' "
f"at line {self.line}, column {self.column}")
# 文件結束,生成剩餘的 DEDENT
while len(self.indent_stack) > 1:
self.indent_stack.pop()
yield Token(TokenType.DEDENT, '', self.line, self.column)
yield Token(TokenType.EOF, '', self.line, self.column)
# 測試
code = '''
def add(a, b):
return a + b
x = add(10, 20)
'''
tokenizer = Tokenizer(code)
for token in tokenizer.tokenize():
print(token)
第三章:語法分析 (Syntax Analysis)
3.1 語法分析的目標
語法分析器(Parser)讀取標記流,根據語法規則構建抽象語法樹(AST)。
graph TB
subgraph "Token Stream"
T1[x] --> T2[=]
T2 --> T3[10]
T3 --> T4[+]
T4 --> T5[20]
end
subgraph "AST"
ASSIGN[Assign]
NAME[Name: x]
ADD[BinOp: +]
N10[Num: 10]
N20[Num: 20]
ASSIGN --> NAME
ASSIGN --> ADD
ADD --> N10
ADD --> N20
end
T2 -.-> ASSIGN
3.2 語法規則與 BNF
Python 的語法用**擴展巴科斯範式(EBNF)**描述:
# Grammar/python.gram (PEG 語法)
file: [statements] ENDMARKER
statements: statement+
statement: compound_stmt | simple_stmts
simple_stmts: simple_stmt (';' simple_stmt)* [';'] NEWLINE
simple_stmt:
| assignment
| return_stmt
| raise_stmt
| 'pass'
| 'break'
| 'continue'
| expression
compound_stmt:
| function_def
| if_stmt
| for_stmt
| while_stmt
| class_def
function_def:
| 'def' NAME '(' [parameters] ')' ':' block
if_stmt:
| 'if' expression ':' block elif_stmt
| 'if' expression ':' block [else_block]
expression:
| disjunction 'if' disjunction 'else' expression
| disjunction
disjunction:
| conjunction ('or' conjunction)+
| conjunction
conjunction:
| inversion ('and' inversion)+
| inversion
comparison:
| sum (comp_op sum)+
| sum
sum:
| sum '+' term
| sum '-' term
| term
term:
| term '*' factor
| term '/' factor
| factor
factor:
| '+' factor
| '-' factor
| power
power:
| primary '**' factor
| primary
primary:
| primary '.' NAME
| primary '(' [arguments] ')'
| primary '[' expression ']'
| atom
atom:
| NAME
| NUMBER
| STRING
| 'True'
| 'False'
| 'None'
| '(' expression ')'
| '[' [list_items] ']'
| '{' [dict_items] '}'
3.3 解析技術比較
graph TB
subgraph "Top-Down 解析"
TD[遞歸下降<br/>Recursive Descent]
LL[LL(k) 解析器]
PEG[PEG 解析器]
TD --> LL
TD --> PEG
end
subgraph "Bottom-Up 解析"
LR[LR 解析器]
LALR[LALR(1)]
GLR[GLR 解析器]
LR --> LALR
LR --> GLR
end
NOTE[Python 3.9+ 使用 PEG]
NOTE -.-> PEG
CPython 從 3.9 版本開始使用 PEG 解析器,取代了原來的 LL(1) 解析器。
3.4 CPython PEG 解析器
// Parser/pegen.c
// PEG 解析器生成 AST
mod_ty
_PyPegen_run_parser_from_string(const char *str, int start_rule,
PyObject *filename, int mode,
PyCompilerFlags *flags,
PyArena *arena)
{
struct tok_state *tok = _PyTokenizer_FromString(str, mode);
if (tok == NULL) {
return NULL;
}
Parser *p = _PyPegen_Parser_New(tok, start_rule, flags, arena);
if (p == NULL) {
_PyTokenizer_Free(tok);
return NULL;
}
mod_ty result = NULL;
// 根據起始規則選擇解析函數
switch (start_rule) {
case Py_file_input:
result = file_rule(p);
break;
case Py_single_input:
result = interactive_rule(p);
break;
case Py_eval_input:
result = eval_rule(p);
break;
}
_PyPegen_Parser_Free(p);
_PyTokenizer_Free(tok);
return result;
}
// 解析 if 語句的規則
// if_stmt: 'if' expression ':' block elif_stmt | 'if' expression ':' block [else_block]
static stmt_ty
if_stmt_rule(Parser *p)
{
Token *keyword;
expr_ty condition;
asdl_stmt_seq *body;
asdl_stmt_seq *orelse = NULL;
// 匹配 'if' 關鍵字
if ((keyword = _PyPegen_expect_token(p, NAME)) == NULL) {
return NULL;
}
if (strcmp(PyBytes_AS_STRING(keyword->bytes), "if") != 0) {
return NULL;
}
// 匹配條件表達式
if ((condition = expression_rule(p)) == NULL) {
return NULL;
}
// 匹配 ':'
if (_PyPegen_expect_token(p, COLON) == NULL) {
return NULL;
}
// 匹配程式區塊
if ((body = block_rule(p)) == NULL) {
return NULL;
}
// 嘗試匹配 elif 或 else
orelse = elif_stmt_rule(p);
if (orelse == NULL) {
orelse = else_block_rule(p);
}
// 構建 AST 節點
return _PyAST_If(condition, body, orelse, EXTRA);
}
3.5 Python AST 結構
import ast
code = '''
def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)
'''
tree = ast.parse(code)
print(ast.dump(tree, indent=2))
輸出:
Module(
body=[
FunctionDef(
name='factorial',
args=arguments(
posonlyargs=[],
args=[arg(arg='n')],
kwonlyargs=[],
kw_defaults=[],
defaults=[]),
body=[
If(
test=Compare(
left=Name(id='n', ctx=Load()),
ops=[LtE()],
comparators=[Constant(value=1)]),
body=[
Return(value=Constant(value=1))],
orelse=[]),
Return(
value=BinOp(
left=Name(id='n', ctx=Load()),
op=Mult(),
right=Call(
func=Name(id='factorial', ctx=Load()),
args=[
BinOp(
left=Name(id='n', ctx=Load()),
op=Sub(),
right=Constant(value=1))],
keywords=[])))],
decorator_list=[])],
type_ignores=[])
3.6 實作遞歸下降解析器
from dataclasses import dataclass
from typing import Any, Optional
from abc import ABC, abstractmethod
# ========== AST 節點定義 ==========
class ASTNode(ABC):
pass
@dataclass
class NumberNode(ASTNode):
value: float
@dataclass
class StringNode(ASTNode):
value: str
@dataclass
class IdentifierNode(ASTNode):
name: str
@dataclass
class BinaryOpNode(ASTNode):
op: str
left: ASTNode
right: ASTNode
@dataclass
class UnaryOpNode(ASTNode):
op: str
operand: ASTNode
@dataclass
class AssignNode(ASTNode):
target: str
value: ASTNode
@dataclass
class IfNode(ASTNode):
condition: ASTNode
then_body: list[ASTNode]
else_body: list[ASTNode]
@dataclass
class WhileNode(ASTNode):
condition: ASTNode
body: list[ASTNode]
@dataclass
class FunctionDefNode(ASTNode):
name: str
params: list[str]
body: list[ASTNode]
@dataclass
class FunctionCallNode(ASTNode):
name: str
args: list[ASTNode]
@dataclass
class ReturnNode(ASTNode):
value: Optional[ASTNode]
@dataclass
class PrintNode(ASTNode):
value: ASTNode
# ========== 解析器實現 ==========
class Parser:
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.pos = 0
def current_token(self) -> Optional[Token]:
if self.pos >= len(self.tokens):
return None
return self.tokens[self.pos]
def peek(self, offset: int = 0) -> Optional[Token]:
pos = self.pos + offset
if pos >= len(self.tokens):
return None
return self.tokens[pos]
def advance(self) -> Token:
token = self.current_token()
self.pos += 1
return token
def expect(self, token_type: TokenType, value: str = None) -> Token:
token = self.current_token()
if token is None:
raise SyntaxError(f"Unexpected end of input, expected {token_type}")
if token.type != token_type:
raise SyntaxError(f"Expected {token_type}, got {token.type} at line {token.line}")
if value and token.value != value:
raise SyntaxError(f"Expected '{value}', got '{token.value}' at line {token.line}")
return self.advance()
def match(self, token_type: TokenType, value: str = None) -> bool:
token = self.current_token()
if token is None:
return False
if token.type != token_type:
return False
if value and token.value != value:
return False
return True
# ========== 語法規則 ==========
def parse(self) -> list[ASTNode]:
"""程式 = 語句列表"""
statements = []
while not self.match(TokenType.EOF):
if self.match(TokenType.NEWLINE):
self.advance()
continue
statements.append(self.parse_statement())
return statements
def parse_statement(self) -> ASTNode:
"""語句 = 複合語句 | 簡單語句"""
token = self.current_token()
if token.type == TokenType.KEYWORD:
if token.value == 'def':
return self.parse_function_def()
elif token.value == 'if':
return self.parse_if()
elif token.value == 'while':
return self.parse_while()
elif token.value == 'return':
return self.parse_return()
return self.parse_simple_statement()
def parse_function_def(self) -> FunctionDefNode:
"""函數定義 = 'def' 名稱 '(' 參數列表 ')' ':' 區塊"""
self.expect(TokenType.KEYWORD, 'def')
name = self.expect(TokenType.IDENTIFIER).value
self.expect(TokenType.LPAREN)
params = []
while not self.match(TokenType.RPAREN):
params.append(self.expect(TokenType.IDENTIFIER).value)
if self.match(TokenType.COMMA):
self.advance()
self.expect(TokenType.RPAREN)
self.expect(TokenType.COLON)
self.expect(TokenType.NEWLINE)
self.expect(TokenType.INDENT)
body = self.parse_block()
return FunctionDefNode(name, params, body)
def parse_block(self) -> list[ASTNode]:
"""區塊 = 縮進語句列表"""
statements = []
while not self.match(TokenType.DEDENT) and not self.match(TokenType.EOF):
if self.match(TokenType.NEWLINE):
self.advance()
continue
statements.append(self.parse_statement())
if self.match(TokenType.DEDENT):
self.advance()
return statements
def parse_if(self) -> IfNode:
"""if 語句 = 'if' 表達式 ':' 區塊 ['else' ':' 區塊]"""
self.expect(TokenType.KEYWORD, 'if')
condition = self.parse_expression()
self.expect(TokenType.COLON)
self.expect(TokenType.NEWLINE)
self.expect(TokenType.INDENT)
then_body = self.parse_block()
else_body = []
if self.match(TokenType.KEYWORD) and self.current_token().value == 'else':
self.advance()
self.expect(TokenType.COLON)
self.expect(TokenType.NEWLINE)
self.expect(TokenType.INDENT)
else_body = self.parse_block()
return IfNode(condition, then_body, else_body)
def parse_while(self) -> WhileNode:
"""while 語句 = 'while' 表達式 ':' 區塊"""
self.expect(TokenType.KEYWORD, 'while')
condition = self.parse_expression()
self.expect(TokenType.COLON)
self.expect(TokenType.NEWLINE)
self.expect(TokenType.INDENT)
body = self.parse_block()
return WhileNode(condition, body)
def parse_return(self) -> ReturnNode:
"""return 語句 = 'return' [表達式]"""
self.expect(TokenType.KEYWORD, 'return')
value = None
if not self.match(TokenType.NEWLINE):
value = self.parse_expression()
if self.match(TokenType.NEWLINE):
self.advance()
return ReturnNode(value)
def parse_simple_statement(self) -> ASTNode:
"""簡單語句 = 賦值 | 表達式"""
# 檢查是否為賦值
if (self.match(TokenType.IDENTIFIER) and
self.peek(1) and self.peek(1).type == TokenType.ASSIGN):
name = self.advance().value
self.advance() # 跳過 '='
value = self.parse_expression()
if self.match(TokenType.NEWLINE):
self.advance()
return AssignNode(name, value)
# 否則是表達式語句
expr = self.parse_expression()
if self.match(TokenType.NEWLINE):
self.advance()
return expr
# ========== 表達式解析(運算符優先級) ==========
def parse_expression(self) -> ASTNode:
"""表達式 = 比較表達式"""
return self.parse_comparison()
def parse_comparison(self) -> ASTNode:
"""比較表達式 = 加法 (比較運算符 加法)*"""
left = self.parse_additive()
while self.current_token() and self.current_token().type in (
TokenType.LT, TokenType.GT, TokenType.LE, TokenType.GE,
TokenType.EQ, TokenType.NE
):
op = self.advance().value
right = self.parse_additive()
left = BinaryOpNode(op, left, right)
return left
def parse_additive(self) -> ASTNode:
"""加法表達式 = 乘法 (('+' | '-') 乘法)*"""
left = self.parse_multiplicative()
while self.current_token() and self.current_token().type in (
TokenType.PLUS, TokenType.MINUS
):
op = self.advance().value
right = self.parse_multiplicative()
left = BinaryOpNode(op, left, right)
return left
def parse_multiplicative(self) -> ASTNode:
"""乘法表達式 = 一元表達式 (('*' | '/') 一元表達式)*"""
left = self.parse_unary()
while self.current_token() and self.current_token().type in (
TokenType.STAR, TokenType.SLASH
):
op = self.advance().value
right = self.parse_unary()
left = BinaryOpNode(op, left, right)
return left
def parse_unary(self) -> ASTNode:
"""一元表達式 = ('+' | '-') 一元表達式 | 主表達式"""
if self.current_token() and self.current_token().type in (
TokenType.PLUS, TokenType.MINUS
):
op = self.advance().value
operand = self.parse_unary()
return UnaryOpNode(op, operand)
return self.parse_primary()
def parse_primary(self) -> ASTNode:
"""主表達式 = 數字 | 字串 | 識別符 | 函數調用 | '(' 表達式 ')'"""
token = self.current_token()
if token.type == TokenType.NUMBER:
self.advance()
return NumberNode(float(token.value))
if token.type == TokenType.STRING:
self.advance()
return StringNode(token.value)
if token.type == TokenType.IDENTIFIER:
name = self.advance().value
# 檢查是否為函數調用
if self.match(TokenType.LPAREN):
self.advance()
args = []
while not self.match(TokenType.RPAREN):
args.append(self.parse_expression())
if self.match(TokenType.COMMA):
self.advance()
self.expect(TokenType.RPAREN)
return FunctionCallNode(name, args)
return IdentifierNode(name)
if token.type == TokenType.KEYWORD:
if token.value == 'True':
self.advance()
return NumberNode(1)
elif token.value == 'False':
self.advance()
return NumberNode(0)
elif token.value == 'None':
self.advance()
return NumberNode(0)
if token.type == TokenType.LPAREN:
self.advance()
expr = self.parse_expression()
self.expect(TokenType.RPAREN)
return expr
raise SyntaxError(f"Unexpected token {token} at line {token.line}")
第四章:字節碼編譯
4.1 什麼是字節碼?
Python 字節碼是一種中間表示,介於源碼和機器碼之間。它是一系列虛擬機指令,由 Python 虛擬機解釋執行。
import dis
def add(a, b):
return a + b
dis.dis(add)
輸出:
2 0 LOAD_FAST 0 (a)
2 LOAD_FAST 1 (b)
4 BINARY_ADD
6 RETURN_VALUE
4.2 CPython 字節碼編譯流程
graph LR
AST[AST] --> SYM[符號表分析<br/>symtable]
SYM --> CFG[控制流圖<br/>CFG]
CFG --> BC[字節碼生成<br/>bytecode]
BC --> OPT[窺孔優化<br/>peephole]
OPT --> CODE[Code Object]
4.3 Python 字節碼指令
import dis
import opcode
# 查看所有操作碼
print("Python 字節碼操作碼:")
for name, code in sorted(opcode.opmap.items(), key=lambda x: x[1]):
print(f" {code:3d}: {name}")
常見指令分類:
# 1. 載入指令
LOAD_CONST # 載入常量
LOAD_FAST # 載入局部變量
LOAD_GLOBAL # 載入全局變量
LOAD_NAME # 載入名稱
LOAD_ATTR # 載入屬性
# 2. 存儲指令
STORE_FAST # 存儲局部變量
STORE_GLOBAL # 存儲全局變量
STORE_NAME # 存儲名稱
STORE_ATTR # 存儲屬性
# 3. 運算指令
BINARY_ADD # 加法
BINARY_SUBTRACT # 減法
BINARY_MULTIPLY # 乘法
BINARY_TRUE_DIVIDE # 除法
COMPARE_OP # 比較
# 4. 控制流指令
POP_JUMP_IF_FALSE # 條件跳轉
POP_JUMP_IF_TRUE
JUMP_FORWARD # 向前跳轉
JUMP_ABSOLUTE # 絕對跳轉
# 5. 函數指令
CALL_FUNCTION # 調用函數
RETURN_VALUE # 返回
MAKE_FUNCTION # 創建函數
# 6. 堆棧操作
POP_TOP # 彈出棧頂
DUP_TOP # 複製棧頂
ROT_TWO # 交換棧頂兩元素
4.4 Code Object 結構
def example(x, y):
z = x + y
return z * 2
code = example.__code__
print(f"函數名: {code.co_name}")
print(f"參數數量: {code.co_argcount}")
print(f"局部變量: {code.co_varnames}")
print(f"常量: {code.co_consts}")
print(f"名稱: {code.co_names}")
print(f"字節碼: {code.co_code.hex()}")
print(f"行號表: {code.co_lnotab.hex()}")
輸出:
函數名: example
參數數量: 2
局部變量: ('x', 'y', 'z')
常量: (None, 2)
名稱: ()
字節碼: 7c007c01170090005a02640153
行號表: 0001040106
4.5 CPython 編譯器源碼
// Python/compile.c
// AST 到字節碼的編譯
static int
compiler_visit_stmt(struct compiler *c, stmt_ty s)
{
switch (s->kind) {
case FunctionDef_kind:
return compiler_function(c, s);
case If_kind:
return compiler_if(c, s);
case While_kind:
return compiler_while(c, s);
case For_kind:
return compiler_for(c, s);
case Return_kind:
return compiler_return(c, s);
case Assign_kind:
return compiler_assign(c, s);
case Expr_kind:
return compiler_expr_stmt(c, s);
// ... 更多語句類型
}
}
// 編譯 if 語句
static int
compiler_if(struct compiler *c, stmt_ty s)
{
basicblock *end, *next;
end = compiler_new_block(c);
if (end == NULL) {
return 0;
}
// 編譯條件表達式
VISIT(c, expr, s->v.If.test);
// 如果有 else 分支
if (asdl_seq_LEN(s->v.If.orelse)) {
next = compiler_new_block(c);
if (next == NULL) {
return 0;
}
// 條件為假時跳轉到 else
ADDOP_JUMP(c, POP_JUMP_IF_FALSE, next);
// 編譯 then 分支
VISIT_SEQ(c, stmt, s->v.If.body);
// 跳過 else 分支
ADDOP_JUMP(c, JUMP_FORWARD, end);
// else 分支
compiler_use_next_block(c, next);
VISIT_SEQ(c, stmt, s->v.If.orelse);
}
else {
// 沒有 else,條件為假時跳到結尾
ADDOP_JUMP(c, POP_JUMP_IF_FALSE, end);
// 編譯 then 分支
VISIT_SEQ(c, stmt, s->v.If.body);
}
compiler_use_next_block(c, end);
return 1;
}
// 編譯二元運算
static int
compiler_boolop(struct compiler *c, expr_ty e)
{
basicblock *end;
Py_ssize_t i, n;
asdl_expr_seq *s;
end = compiler_new_block(c);
s = e->v.BoolOp.values;
n = asdl_seq_LEN(s);
for (i = 0; i < n - 1; i++) {
VISIT(c, expr, asdl_seq_GET(s, i));
// and: 短路求值,為假時跳轉
// or: 短路求值,為真時跳轉
if (e->v.BoolOp.op == And) {
ADDOP_JUMP(c, JUMP_IF_FALSE_OR_POP, end);
}
else {
ADDOP_JUMP(c, JUMP_IF_TRUE_OR_POP, end);
}
}
// 最後一個表達式
VISIT(c, expr, asdl_seq_GET(s, n - 1));
compiler_use_next_block(c, end);
return 1;
}
第五章:Python 虛擬機 (PVM)
5.1 虛擬機架構
graph TB
subgraph "Python Virtual Machine"
EVAL[求值迴路<br/>Evaluation Loop]
subgraph "Frame Stack"
F1[Frame 1<br/>main]
F2[Frame 2<br/>function]
F3[Frame 3<br/>nested]
end
subgraph "Each Frame"
STACK[Value Stack]
LOCALS[Local Variables]
GLOBALS[Global Variables]
CODE[Code Object]
end
EVAL --> F1
F1 --> F2
F2 --> F3
F3 --> STACK
F3 --> LOCALS
F3 --> GLOBALS
F3 --> CODE
end
5.2 求值迴路 (Evaluation Loop)
CPython 的核心是 Python/ceval.c 中的求值迴路:
// Python/ceval.c
// 主求值迴路 - 這是 Python 程式執行的核心
PyObject *
_PyEval_EvalFrameDefault(PyThreadState *tstate, PyFrameObject *f, int throwflag)
{
// 關鍵寄存器
PyObject **stack_pointer; // 堆棧指針
_Py_CODEUNIT *next_instr; // 指令指針
int opcode; // 當前操作碼
int oparg; // 操作數
// 快取常用對象
PyObject *names; // 名稱元組
PyObject *consts; // 常量元組
PyObject **fastlocals; // 快速局部變量
// 初始化
names = code->co_names;
consts = code->co_consts;
fastlocals = f->f_localsplus;
stack_pointer = f->f_valuestack;
next_instr = f->f_lasti + 1;
// 主迴路
for (;;) {
// 獲取下一條指令
opcode = _Py_OPCODE(*next_instr);
oparg = _Py_OPARG(*next_instr);
next_instr++;
// 巨大的 switch 語句處理每個操作碼
switch (opcode) {
case LOAD_FAST: {
// 載入局部變量到堆棧
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
goto unbound_local_error;
}
Py_INCREF(value);
PUSH(value);
DISPATCH();
}
case LOAD_CONST: {
// 載入常量到堆棧
PyObject *value = GETITEM(consts, oparg);
Py_INCREF(value);
PUSH(value);
DISPATCH();
}
case STORE_FAST: {
// 存儲堆棧頂到局部變量
PyObject *value = POP();
SETLOCAL(oparg, value);
DISPATCH();
}
case BINARY_ADD: {
// 加法運算
PyObject *right = POP();
PyObject *left = TOP();
PyObject *result = PyNumber_Add(left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(result);
if (result == NULL) {
goto error;
}
DISPATCH();
}
case COMPARE_OP: {
// 比較運算
PyObject *right = POP();
PyObject *left = TOP();
PyObject *result = cmp_outcome(tstate, oparg, left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(result);
if (result == NULL) {
goto error;
}
DISPATCH();
}
case POP_JUMP_IF_FALSE: {
// 條件跳轉
PyObject *cond = POP();
int is_true = PyObject_IsTrue(cond);
Py_DECREF(cond);
if (is_true < 0) {
goto error;
}
if (!is_true) {
JUMPTO(oparg);
}
DISPATCH();
}
case CALL_FUNCTION: {
// 函數調用
PyObject **sp = stack_pointer;
PyObject *result = call_function(tstate, &sp, oparg);
stack_pointer = sp;
PUSH(result);
if (result == NULL) {
goto error;
}
DISPATCH();
}
case RETURN_VALUE: {
// 函數返回
retval = POP();
goto exit_returning;
}
// ... 更多操作碼 (約 100+ 個)
}
}
exit_returning:
return retval;
error:
// 錯誤處理
return NULL;
}
5.3 堆棧機 vs 寄存器機
graph TB
subgraph "堆棧機 (CPython)"
S1[LOAD a]
S2[LOAD b]
S3[ADD]
S4[STORE c]
S1 --> S2 --> S3 --> S4
STACK1["Stack: []"]
STACK2["Stack: [a]"]
STACK3["Stack: [a, b]"]
STACK4["Stack: [a+b]"]
STACK5["Stack: []"]
end
subgraph "寄存器機 (Lua)"
R1["ADD R3, R1, R2"]
R2_["MOVE R4, R3"]
end
Python 使用堆棧機架構:
- 優點:指令簡單,編譯器實現容易
- 缺點:需要更多指令,更多記憶體操作
5.4 實作簡單的虛擬機
from enum import IntEnum
from dataclasses import dataclass
from typing import Any
class OpCode(IntEnum):
# 載入/存儲
LOAD_CONST = 1
LOAD_VAR = 2
STORE_VAR = 3
# 運算
ADD = 10
SUB = 11
MUL = 12
DIV = 13
NEG = 14
# 比較
LT = 20
GT = 21
LE = 22
GE = 23
EQ = 24
NE = 25
# 控制流
JUMP = 30
JUMP_IF_FALSE = 31
JUMP_IF_TRUE = 32
# 函數
CALL = 40
RETURN = 41
# 其他
PRINT = 50
POP = 51
DUP = 52
HALT = 99
@dataclass
class Instruction:
opcode: OpCode
arg: Any = None
class VirtualMachine:
def __init__(self):
self.stack: list = []
self.variables: dict = {}
self.functions: dict = {}
self.call_stack: list = [] # (return_address, saved_variables)
self.ip: int = 0 # 指令指針
def push(self, value):
self.stack.append(value)
def pop(self):
if not self.stack:
raise RuntimeError("Stack underflow")
return self.stack.pop()
def top(self):
if not self.stack:
raise RuntimeError("Stack is empty")
return self.stack[-1]
def execute(self, instructions: list[Instruction]):
self.ip = 0
while self.ip < len(instructions):
instr = instructions[self.ip]
self.ip += 1
match instr.opcode:
case OpCode.LOAD_CONST:
self.push(instr.arg)
case OpCode.LOAD_VAR:
name = instr.arg
if name in self.variables:
self.push(self.variables[name])
else:
raise RuntimeError(f"Undefined variable: {name}")
case OpCode.STORE_VAR:
name = instr.arg
self.variables[name] = self.pop()
case OpCode.ADD:
b = self.pop()
a = self.pop()
self.push(a + b)
case OpCode.SUB:
b = self.pop()
a = self.pop()
self.push(a - b)
case OpCode.MUL:
b = self.pop()
a = self.pop()
self.push(a * b)
case OpCode.DIV:
b = self.pop()
a = self.pop()
self.push(a / b)
case OpCode.NEG:
a = self.pop()
self.push(-a)
case OpCode.LT:
b = self.pop()
a = self.pop()
self.push(1 if a < b else 0)
case OpCode.GT:
b = self.pop()
a = self.pop()
self.push(1 if a > b else 0)
case OpCode.LE:
b = self.pop()
a = self.pop()
self.push(1 if a <= b else 0)
case OpCode.GE:
b = self.pop()
a = self.pop()
self.push(1 if a >= b else 0)
case OpCode.EQ:
b = self.pop()
a = self.pop()
self.push(1 if a == b else 0)
case OpCode.NE:
b = self.pop()
a = self.pop()
self.push(1 if a != b else 0)
case OpCode.JUMP:
self.ip = instr.arg
case OpCode.JUMP_IF_FALSE:
condition = self.pop()
if not condition:
self.ip = instr.arg
case OpCode.JUMP_IF_TRUE:
condition = self.pop()
if condition:
self.ip = instr.arg
case OpCode.CALL:
func_name, arg_count = instr.arg
if func_name not in self.functions:
raise RuntimeError(f"Undefined function: {func_name}")
func_addr, param_names = self.functions[func_name]
# 保存調用現場
saved_vars = self.variables.copy()
self.call_stack.append((self.ip, saved_vars))
# 綁定參數
args = [self.pop() for _ in range(arg_count)][::-1]
self.variables = {}
for name, value in zip(param_names, args):
self.variables[name] = value
# 跳轉到函數
self.ip = func_addr
case OpCode.RETURN:
return_value = self.pop() if self.stack else None
if self.call_stack:
self.ip, self.variables = self.call_stack.pop()
if return_value is not None:
self.push(return_value)
else:
return return_value
case OpCode.PRINT:
value = self.pop()
print(value)
case OpCode.POP:
self.pop()
case OpCode.DUP:
self.push(self.top())
case OpCode.HALT:
break
return None
# ========== 編譯器:AST 到字節碼 ==========
class Compiler:
def __init__(self):
self.instructions: list[Instruction] = []
self.functions: dict = {} # name -> (address, params)
def emit(self, opcode: OpCode, arg=None) -> int:
addr = len(self.instructions)
self.instructions.append(Instruction(opcode, arg))
return addr
def compile(self, nodes: list[ASTNode]) -> list[Instruction]:
# 第一遍:收集函數定義
for node in nodes:
if isinstance(node, FunctionDefNode):
self.compile_function_def(node)
# 第二遍:編譯主程式
main_start = len(self.instructions)
for node in nodes:
if not isinstance(node, FunctionDefNode):
self.compile_node(node)
self.emit(OpCode.HALT)
return self.instructions, self.functions, main_start
def compile_function_def(self, node: FunctionDefNode):
func_addr = len(self.instructions)
self.functions[node.name] = (func_addr, node.params)
for stmt in node.body:
self.compile_node(stmt)
# 確保函數有返回值
self.emit(OpCode.LOAD_CONST, None)
self.emit(OpCode.RETURN)
def compile_node(self, node: ASTNode):
match node:
case NumberNode(value):
self.emit(OpCode.LOAD_CONST, value)
case StringNode(value):
self.emit(OpCode.LOAD_CONST, value)
case IdentifierNode(name):
self.emit(OpCode.LOAD_VAR, name)
case BinaryOpNode(op, left, right):
self.compile_node(left)
self.compile_node(right)
op_map = {
'+': OpCode.ADD,
'-': OpCode.SUB,
'*': OpCode.MUL,
'/': OpCode.DIV,
'<': OpCode.LT,
'>': OpCode.GT,
'<=': OpCode.LE,
'>=': OpCode.GE,
'==': OpCode.EQ,
'!=': OpCode.NE,
}
self.emit(op_map[op])
case UnaryOpNode(op, operand):
self.compile_node(operand)
if op == '-':
self.emit(OpCode.NEG)
case AssignNode(target, value):
self.compile_node(value)
self.emit(OpCode.STORE_VAR, target)
case IfNode(condition, then_body, else_body):
self.compile_node(condition)
if else_body:
# if-else
else_jump = self.emit(OpCode.JUMP_IF_FALSE, None)
for stmt in then_body:
self.compile_node(stmt)
end_jump = self.emit(OpCode.JUMP, None)
# else 分支
else_addr = len(self.instructions)
self.instructions[else_jump].arg = else_addr
for stmt in else_body:
self.compile_node(stmt)
end_addr = len(self.instructions)
self.instructions[end_jump].arg = end_addr
else:
# if only
end_jump = self.emit(OpCode.JUMP_IF_FALSE, None)
for stmt in then_body:
self.compile_node(stmt)
end_addr = len(self.instructions)
self.instructions[end_jump].arg = end_addr
case WhileNode(condition, body):
loop_start = len(self.instructions)
self.compile_node(condition)
end_jump = self.emit(OpCode.JUMP_IF_FALSE, None)
for stmt in body:
self.compile_node(stmt)
self.emit(OpCode.JUMP, loop_start)
end_addr = len(self.instructions)
self.instructions[end_jump].arg = end_addr
case FunctionCallNode(name, args):
for arg in args:
self.compile_node(arg)
self.emit(OpCode.CALL, (name, len(args)))
case ReturnNode(value):
if value:
self.compile_node(value)
else:
self.emit(OpCode.LOAD_CONST, None)
self.emit(OpCode.RETURN)
case PrintNode(value):
self.compile_node(value)
self.emit(OpCode.PRINT)
# ========== 完整解釋器 ==========
class MiniPythonInterpreter:
def __init__(self):
self.vm = VirtualMachine()
def run(self, source: str):
# 1. 詞法分析
tokenizer = Tokenizer(source)
tokens = list(tokenizer.tokenize())
# 2. 語法分析
parser = Parser(tokens)
ast = parser.parse()
# 3. 編譯
compiler = Compiler()
instructions, functions, main_start = compiler.compile(ast)
# 4. 執行
self.vm.functions = functions
self.vm.ip = main_start
# Debug: 打印字節碼
print("=== Bytecode ===")
for i, instr in enumerate(instructions):
print(f"{i:4d}: {instr.opcode.name:20} {instr.arg}")
print("================\n")
return self.vm.execute(instructions)
# 測試
if __name__ == "__main__":
code = '''
def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)
result = factorial(5)
print(result)
'''
interpreter = MiniPythonInterpreter()
interpreter.run(code)
第六章:Python 對象模型
6.1 一切皆對象
在 Python 中,所有東西都是對象,包括數字、字串、函數、類別,甚至是類型本身。
graph TB
subgraph "Python 對象層次"
OBJ[object]
TYPE[type]
OBJ --> TYPE
TYPE --> TYPE
INT[int]
STR[str]
LIST[list]
FUNC[function]
OBJ --> INT
OBJ --> STR
OBJ --> LIST
OBJ --> FUNC
TYPE --> INT
TYPE --> STR
TYPE --> LIST
TYPE --> FUNC
end
6.2 PyObject 結構
// Include/object.h
// 所有 Python 對象的基礎結構
typedef struct _object {
Py_ssize_t ob_refcnt; // 引用計數
PyTypeObject *ob_type; // 類型指針
} PyObject;
// 變長對象
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; // 元素數量
} PyVarObject;
// 整數對象 (小整數池化)
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1]; // 數字陣列
};
// 字串對象
typedef struct {
PyObject_VAR_HEAD
Py_hash_t hash; // 快取的 hash 值
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
} state;
wchar_t *wstr;
} PyASCIIObject;
// 列表對象
typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item; // 元素陣列
Py_ssize_t allocated; // 分配的空間
} PyListObject;
// 字典對象
typedef struct {
PyObject_HEAD
Py_ssize_t ma_used; // 使用的條目數
uint64_t ma_version_tag; // 版本標籤
PyDictKeysObject *ma_keys;
PyObject **ma_values;
} PyDictObject;
6.3 類型對象
// Include/cpython/object.h
typedef struct _typeobject {
PyObject_VAR_HEAD
const char *tp_name; // 類型名稱
Py_ssize_t tp_basicsize; // 基本大小
Py_ssize_t tp_itemsize; // 每個元素大小
// 特殊方法槽
destructor tp_dealloc; // __del__
reprfunc tp_repr; // __repr__
PyNumberMethods *tp_as_number; // 數值方法
PySequenceMethods *tp_as_sequence; // 序列方法
PyMappingMethods *tp_as_mapping; // 映射方法
hashfunc tp_hash; // __hash__
ternaryfunc tp_call; // __call__
reprfunc tp_str; // __str__
getattrofunc tp_getattro; // __getattr__
setattrofunc tp_setattro; // __setattr__
richcmpfunc tp_richcompare; // __eq__, __lt__, 等
Py_ssize_t tp_weaklistoffset; // 弱引用
getiterfunc tp_iter; // __iter__
iternextfunc tp_iternext; // __next__
struct PyMethodDef *tp_methods; // 方法列表
struct PyMemberDef *tp_members; // 成員列表
struct PyGetSetDef *tp_getset; // 屬性
PyTypeObject *tp_base; // 基類
PyObject *tp_dict; // 類型字典
descrgetfunc tp_descr_get; // 描述符 get
descrsetfunc tp_descr_set; // 描述符 set
initproc tp_init; // __init__
allocfunc tp_alloc; // 分配
newfunc tp_new; // __new__
freefunc tp_free; // 釋放
} PyTypeObject;
6.4 方法解析順序 (MRO)
class A:
def method(self):
return "A"
class B(A):
def method(self):
return "B"
class C(A):
def method(self):
return "C"
class D(B, C):
pass
# MRO 使用 C3 線性化算法
print(D.__mro__)
# (<class 'D'>, <class 'B'>, <class 'C'>, <class 'A'>, <class 'object'>)
d = D()
print(d.method()) # 輸出: B
graph BT
D --> B
D --> C
B --> A
C --> A
A --> object
subgraph "MRO 順序"
M1[1. D] --> M2[2. B]
M2 --> M3[3. C]
M3 --> M4[4. A]
M4 --> M5[5. object]
end
第七章:記憶體管理
7.1 引用計數
// Include/object.h
#define Py_INCREF(op) (((PyObject *)(op))->ob_refcnt++)
#define Py_DECREF(op) \
do { \
PyObject *_py_decref_tmp = (PyObject *)(op); \
if (--(_py_decref_tmp->ob_refcnt) == 0) { \
_Py_Dealloc(_py_decref_tmp); \
} \
} while (0)
import sys
a = [1, 2, 3]
print(sys.getrefcount(a)) # 2 (a 和 getrefcount 的參數)
b = a
print(sys.getrefcount(a)) # 3
del b
print(sys.getrefcount(a)) # 2
7.2 垃圾回收 (循環引用)
graph LR
subgraph "循環引用問題"
A[Object A<br/>refcount=1]
B[Object B<br/>refcount=1]
A --> |引用| B
B --> |引用| A
end
subgraph "標記清除"
G1[GC 根] --> A2[A<br/>可達]
A2 --> B2[B<br/>可達]
C2[C<br/>不可達] --> D2[D<br/>不可達]
D2 --> C2
end
// Modules/gcmodule.c
// 分代垃圾回收
#define NUM_GENERATIONS 3
struct gc_generation {
PyGC_Head head;
int threshold; // 觸發閾值
int count; // 當前對象數
};
static struct gc_generation generations[NUM_GENERATIONS] = {
/* 第 0 代 - 新對象 */
{{(uintptr_t)_GEN_HEAD(0), (uintptr_t)_GEN_HEAD(0)}, 700, 0},
/* 第 1 代 */
{{(uintptr_t)_GEN_HEAD(1), (uintptr_t)_GEN_HEAD(1)}, 10, 0},
/* 第 2 代 - 長壽對象 */
{{(uintptr_t)_GEN_HEAD(2), (uintptr_t)_GEN_HEAD(2)}, 10, 0},
};
static Py_ssize_t
gc_collect_main(PyThreadState *tstate, int generation)
{
// 1. 將所有可追蹤對象的 gc_refs 設為 ob_refcnt
update_refs(young);
// 2. 遍歷對象,減少被引用對象的 gc_refs
subtract_refs(young);
// 3. gc_refs > 0 的對象是可達的,移動到不可回收列表
gc_list_init(&unreachable);
move_unreachable(young, &unreachable);
// 4. 釋放不可達對象
delete_garbage(&unreachable);
return n;
}
7.3 記憶體分配器
graph TB
subgraph "Python 記憶體架構"
L3[Python Object Allocator<br/>對象分配器]
L2[Python Memory Allocator<br/>pymalloc]
L1[System Allocator<br/>malloc/free]
L0[OS Memory<br/>mmap/brk]
L3 --> L2
L2 --> L1
L1 --> L0
end
subgraph "pymalloc 結構"
ARENA[Arena 256KB]
POOL1[Pool 4KB]
POOL2[Pool 4KB]
BLOCK1[Block 8B]
BLOCK2[Block 16B]
ARENA --> POOL1
ARENA --> POOL2
POOL1 --> BLOCK1
POOL2 --> BLOCK2
end
第八章:進階主題
8.1 描述符協議
class Descriptor:
def __get__(self, obj, objtype=None):
print(f"Getting from {obj}")
return obj._value
def __set__(self, obj, value):
print(f"Setting {value} on {obj}")
obj._value = value
class MyClass:
attr = Descriptor()
def __init__(self, value):
self.attr = value
obj = MyClass(42)
print(obj.attr)
8.2 元類 (Metaclass)
class SingletonMeta(type):
_instances = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
cls._instances[cls] = super().__call__(*args, **kwargs)
return cls._instances[cls]
class Database(metaclass=SingletonMeta):
def __init__(self):
print("Creating database connection")
# 只會創建一次
db1 = Database()
db2 = Database()
print(db1 is db2) # True
8.3 生成器與協程
import dis
def generator_example():
yield 1
yield 2
yield 3
# 查看生成器字節碼
dis.dis(generator_example)
# 生成器使用 YIELD_VALUE 指令
# 0 LOAD_CONST 1 (1)
# 2 YIELD_VALUE
# 4 POP_TOP
# 6 LOAD_CONST 2 (2)
# 8 YIELD_VALUE
# 10 POP_TOP
# 12 LOAD_CONST 3 (3)
# 14 YIELD_VALUE
# 16 POP_TOP
# 18 LOAD_CONST 0 (None)
# 20 RETURN_VALUE
總結
創造一門程式語言涉及多個階段:
graph LR
A[設計語法] --> B[實現詞法分析器]
B --> C[實現語法分析器]
C --> D[構建 AST]
D --> E[語義分析]
E --> F[生成中間碼/字節碼]
F --> G[實現虛擬機/編譯器]
G --> H[記憶體管理]
H --> I[標準庫]
Python 的成功在於:
- 簡潔的語法:縮進而非大括號
- 動態類型:靈活但有成本
- 豐富的對象模型:一切皆對象
- 自動記憶體管理:引用計數 + 分代 GC
- 優秀的生態系統:標準庫和第三方庫
理解這些底層機制,將幫助你:
- 寫出更高效的 Python 代碼
- 更好地調試和優化程式
- 有能力為 Python 貢獻代碼
- 設計和實現自己的程式語言