2019-06-25

遞迴下降分析器的簡介

語法樹
文法
遞迴下降解析
實做
迴圈定義的文法

以下內容僅為非資訊專業者所寫的心得，不保證內容正確性。
（最近更新：2019-07-22）

語法樹

就算沒有學過編譯器的人我想很多都知道，編譯器是把一種語言的程式翻譯成另一種語言者的工具。在轉成其他語言之前，需要將這個語言轉換成一棵樹狀結構 (抽象語法樹，abstract syntax tree)，來分析被編譯的程式內容是什麼。大概就像我們在翻譯一門語言的內容前，可以先將它轉換成語法樹。

比如說華語「我有筆」可以被剖析成以下的語法樹：

「我有筆」的自然語言語法樹。

我們藉由這樣的語法樹，就可以知道這句話的主詞、受詞、述語及其相對關係，對於理解句子的意思頗有幫助。同樣的，程式語言可以將其中的內容，藉由文法分析來得到其抽象語法樹，得到這段程式碼所要表達的意思，進而做後續的編譯或是直譯處理：

比如說 3 + 5 * (12 - 7) 可以解析為：

3 + 5 * (12 - 7) 的自然語言語法樹。

如果有學 Lisp 系的人 Scheme 的可以發現，就算 LISP 可以被解讀為 Lots of Insanely Strange Parenthesis 等等，但這種使用大量巢狀括號的方法 (S-expression) 可以高度一致、精簡而優雅的表達樹形結構，當然也可以用來表達 AST，比如上圖的 AST 的 S-expression 為 (+ 3 (* 5 (- 12 7)))。

文法

文法這裡可以理解為各種不可分割的程式片段，可以是變數名、字元等等，它們的組合規則。比如說一個中文子集合的文法可以定義如下（使用擴充BNF 文法表示）：

$句 = 主語 , 謂語$
$主語 = 名詞 | 代詞$
$代詞 = “我” | “你” | “他”$
$謂語 = 動詞 , 受詞$
$謂語 = “看” | “摸” | “吃”$
$受詞 = 名詞 | 代詞$
$名詞 = “冰” | “飯” | “菜”$

用這樣的形式，可以組合成「我吃冰」、「你看菜」、「飯看我」、「飯看冰」……等等句子。其中「我」、「你」、「他」、「看」、「菜」等等都是終端符號，其他是非終端符號；左手邊的符號只有一個項，不需要旁邊鄰接其他項（上下文）區別文法規則，所以這是上下文無關語法。

那我們定義文法，要如何用語法剖析抽象語法樹呢？假設有以下的運算表示文法（{…}表示括弧內的項目可重複）：

$$
\begin{align}

& expr = term, {“+”, term} \

& term = factor, {“*”, factor} \

& factor = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9” \

\end{align}
$$

要如何套用到運算式進行分析呢？假設我們有式子$3 * 5 + 1$。我們可以用第一條規則($expr = term, {“+”, term}$)切分為：

term       +  term
3 * 5      +  1

則我們可以剖析局部 AST 為：(+ [to-be-parsed 3 * 5] 1)。然後term = 3 * 5可以用第二條語法規則$term = factor, {“*”, factor}$剖析為：

factor    *   factor
3         *   5

因為 3 和 5 可以是 factor 的值，所以這個文法樹可以順利的剖析為(+ (* 3 5) 1)。如果亂剖析成 $term = 3 * 5 + 1$，然後調用下一條語法規則剖析為$左factor = 3$，$右 factor = 5 + 1$，那你會無法繼續剖析成功。

注意包含優先項比較低的運算子（如本例的 +）的文法規則寫在比較前面。假設我們不寫在前面好了，變成以下這樣的文法：

$$
\begin{align}

& expr = factor {“*”, factor} \

& factor = term {“+”, term} \

& term = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9” \

\end{align}
$$

那$3 * 5 + 1$假設拆開為：

factor *    factor
3      *    5 + 1

則右邊的 factor（=5 + 1）可以拆開為：

term  +   term
5     +   10

但是這樣的 AST 就會是(* 3 (+ 5 1))。不合我們要求

另外如果我們忽略之前的分析法，亂拆成：$factor = 3 * 5 + 1$，則

term    +   term
3 * 5   +   1

那左邊的 term（=3 * 5）不能被分拆，無法得到想要的結果。所以優先順序比較低的文法規則寫在前面。

因此我們可以手動比對語法規則，並且剖析出正確的語法樹。遞迴下降分析法的分析方式，大體上也是相似的。

遞迴下降解析

遞歸下降分析 (Recursive Descent Parsing) 是一種 LL 分析器。因為其輸入的 token，從左邊 (Left) 逐一讀進去分析（LL 的第一個 L），然後從左邊逐一解析出正確的構造樹是什麼（LL 的第二個 L），所以叫做 LL 分析器。雖然能處理的文法比較狹隘，但是因為構造除錯簡單，易於手動實做了解問題，許多分析器仍然使用該技術。

還是不知道運作流程？在我們實做出一個基本的遞歸下降分析器前，我們先做一個人肉操作示範。考慮這組文法：

$$
\begin{align}

& expr = term, {“+”, term}……(1) \

& term = factor, {“*”, factor}……(2) \

& factor = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9”……(3) \

\end{align}
$$

指定一些變數：

$input\underline{\smash{ }} token = 3 + 5 * 7 $

$current\underline{\smash{ }} token = NULL$

$next\underline{\smash{ }} token = NULL$

$result = NULL$

我們令 $current\underline{\smash{ }} token = 3$，$next\underline{\smash{ }} token = +$，調到文法(1)，我們認為 $current\underline{\smash{ }} token$ 可能符合 term，所以調用文法(2)，然後同理調用文法(3)。最後我們發現等同於符號 “3”。

但是我們可以想，$current\underline{\smash{ }} token$ 和 $next\underline{\smash{ }} token$ 中間又有什麼關係？我們調回去文法(2)，令 $factor = 3$。結果因為該文法規定下一個 token 是 ，初判不符合 $term = factor, ““, term$ 這個文法規則，所以令 $term = factor$，回推到文法(1)。

這時我們看到，因為 $next\underline{\smash{ }} token = +$，所以整串應該符合 $expr = term, “+”, term$ 這個形式，結果就如：

result = TERM + ???
         3    + ???

，因為右邊還有一部份的解析樹尚未完成，我們再用 $term$ 來解析右邊的 token 串。此時

$current\underline{\smash{ }} token = 5$

$next\underline{\smash{ }} token = *$

我們依次帶入文法(2)、文法(3)後，知道 $factor = 5$，再回溯到上一層語法，這時候和 $next\underline{\smash{ }} token$ 比較，符合$term = factor, “*”, term$ 形式。

所以這時候往前讀 token，導致 $current\underline{\smash{ }} token = 7$，$next\underline{\smash{ }} token = None$（因為沒有剩下來的了）。此時的 result（括號是輔助區分結合關係的）：

result = TERM + (FACTOR * ???)
         3    +  5      * ???

。

我們把 $current\underline{\smash{ }} token = 7$ 代入 $term$ 往下解析再回推，最後我們可以得到底下的樹：

result = TERM + (FACTOR * TERM)
         3    +  5      * 7

。

result 整體符合 expr 的文法標準，可以傳回剖析樹。

實做

有了以上的概念，我們就可以實做簡單的遞迴下降分析器。根據上面的文法，以及操作流程，我們可以以 Python 寫下如下的 code：

#!/usr/bin/env python3
'''
A simple Recursive Descent Parser. 
GRAMMA RULES:
expr=term,{“+",term}
term=factor,{“∗",factor}
factor=“0"|“1"|“2"|“3"|“4"|“5"|“6"|“7"|“8"|“9"
Author: Yoxem <yoxem.tem98[ag]nctu.edu.tw>
License: Public Domain
'''

token_list = ["3", "+", "5", "*", "7"] # the tokens to be parsed.

current_token_index = 0

next_token_index = current_token_index + 1

"""moving forward to the next token"""
def move_forward():
    global token_list, current_token_index, next_token_index # use the outer variable
    current_token_index += 1
    next_token_index += 1

    """If current_token_index indicates the last token in the list,
    set next_token_index to None."""
    if current_token_index == len(token_list) - 1:
        next_token_index = None

"""expr = term,{“+",term}"""
def expr():
    global token_list, current_token_index
    result = term() # go to lower grammar rule

    """using loops to save the recursivily-defined grammar.
    If the next token can't be gotten, don't enter the loop"""
    while next_token_index != None and token_list[next_token_index] == "+":
        # if the result is a list, set the left term (result) as the left item(sub-tree).
        left = result

        move_forward()
        middle = token_list[current_token_index]
        move_forward()
        right = term()

        """arrange the syntax tree (in S-exp like form)"""
        result = [middle, left, right]
    
    return result

"""term = factor,{“∗",factor}"""
def term():
    global token_list, current_token_index
    result = factor()

    while next_token_index != None and token_list[next_token_index] == "*":
        left = result 
        move_forward()
        middle = token_list[current_token_index]
        move_forward()
        right = term()

        """arrange the syntax tree (in S-exp like form)"""
        result = [middle, left, right]
    
    return result

'''factor = “0"|“1"|“2"|“3"|“4"|“5"|“6"|“7"|“8"|“9"'''
def factor():
    global token_list, current_token_index
    item = int(token_list[current_token_index])

    # if it matches the pattern, return the number in integer type, or raise a error.
    if item in range(0,10): # pattern = 0, 1, 2, ..., 9
        return item
    else:
        raise Exception("Except integer from 0 to 9, given %d" % item)

# Let's parsing!
parsed_SExp_result = expr()
print(parsed_SExp_result) # Show the result

這個簡單的遞迴下降分析器，輸出的結果是形似 S-expression 的巢狀 list：

['+', 3, ['*', 5, 7]]

我們利用 current_token_index 來追蹤目前讀取的 token 是第幾個，next_token_index 指示下一個 token 是第幾個。move_forward 來前進到下一個，以更新目前 token 和下一個 token。

至於那些以文法規則左式命名的函數，它的運作機理大致如下：先讀取左邊（就是 current_token_index 尚未移位的時候）的 token，傳遞到下層文法規則的函數。然後檢查是終端符號（不能夠用其他符號指代的符號，此例為 factor 中的 0、1、……、9）後，再送到上層文法規則的函數。之後，如果某上層文法規則的函數檢查到下一個 token 符合該函數代表的文法規則，則調用右式最右邊符號為名的函數，處理分析剩下的 token，最後將分析好的，符合該層語法的 AST 左、中、右項組成 S-expression 樹，最後送到上層文法規則的函數處理剩下還未解析的部份，或是丟出結果。

最後要提：這裡先不要管迴圈，之後再提。

迴圈定義的文法

常常有時候文法需要以迴圈定義的。如 $expr = term, { “+” , term }$ 中，$“+” , term $ 可以出現不只一次。遇到這種情況，有許多處理方法，這裡使用迴圈處理。

我們看到 code 第 29-48行：

"""expr=term,{“+",term}"""
def expr():
    global token_list, current_token_index
    result = term() # go to lower grammar rule

    """using loops to save the recursivily-defined grammar.
    If the next token can't be gotten, don't enter the loop"""
    while next_token_index != None and token_list[next_token_index] == "+":
        # if the result is a list, set the left term (result) as the left item(sub-tree).
        left = result

        move_forward()
        middle = token_list[current_token_index]
        move_forward()
        right = term()

        """arrange the syntax tree (in S-exp like form)"""
        result = [middle, left, right]
    
    return result

這裡我們看見有三種情況：

如果下一組 token 不符合 +，或是下一組 token 不存在，則我們不進入 while 迴圈，直接回傳 result。
如果之後的 tokens 是 + term或+ term [非+token]*，則我們回傳的就是一組 S-expression，那進入迴圈之前讀 result 就是 S-expression 的 left 部份，中間部份是 +，後面部份是之後呼叫的 term 的回傳值。之後我們把這 3 個值組合成 list 存入至 result。因為下一組 token 不是 +，所以可以直接返回 result。
如果之後的 tokens 是 + term (+ term)*，則我們進入迴圈，第一次算入三個值存入到 result 後，因為之後的 token 還是 +，所以這個 list result 變成左式（符合我們對左結合的要求，如果是右結合需要另外用其他方式撰寫），再求出新的 result。最終 result 形如 [ + [ … [ + term term] … ] term]。

另外我們要注意的是，一文法規則右式的第一個符號，不應該和該文法規則左式的符號相同。要不然會無窮呼叫名為文法規則左式符號的函數，跳不出來。

北郊獨語 Yoxem's Blog

溪埔的石仔、山嶺的果樹之間，version 3.0

遞迴下降分析器的簡介

語法樹

文法

遞迴下降解析

實做

迴圈定義的文法