Algorithm

May 27, 2020 · View on GitHub

Let A=a11a12..a1k1,a21..aNkNA = a_{11}a_{12}..a_{1k_1},a_{21}..a_{Nk_N} and B=b11b12..b1l1,b21..bMlMB = b_{11}b_{12}..b_{1l_1},b_{21}..b_{Ml_M} be tokens of length N and M respectively. Each token AiA_i in AA and BjB_j in BB have length kik_i and ljl_j respectively. The alignment ALABAL_{AB} of AA to BB is such that jALAB,i=>BjAi\forall j \in AL_{AB,i} => B_j \cap A_i. (tst \cap s means t partially matches s.) For example, a=["f","o","o"],b=["fo","o"]=>ALAB=[[1],[1],[2]],ALBA=[[1,2],[3]]a=["f","o","o"], b=["fo","o"] => AL_{AB} = [[1],[1],[2]], AL_{BA} = [[1, 2], [3]]. The goal of this algorithm is to find such ALABAL_{AB} and ALBAAL_{BA}

  1. Normalize tokens in the unicode normalization form "NFKD", then lowercase all characters.
  2. Concatenate all tokens AA and BB to generate TATA and TBTB respectively
  3. Calculate shortest path on edit graph of TATA and TBTB
  4. Get character mapping CABC_{AB} and CBAC_{BA} from the edit graph
  5. Get ALABAL_{AB} and ALBAAL_{BA} from the character alignments CABC_{AB} and CBAC_{BA}

Details:

  1. Normalize tokens in the unicode normalization form "NFKD"

To compare the token positions, we must compare each characters in tokens. Because the two tokenizations may be partially different, we normalize them in "NFKD" and lowercase them first.

  1. Concatenate all tokens AA and BB to generate TATA and TBTB respectively

Before calculating the edit graph, we combine tokens into text. For example, if we have tokens ["Foo", "bar"], we concatenate them into one text Foobar.

  1. Calculate shortest path on edit graph from TATA and TBTB

We calculate the shortest path on edit graph from texts TATA and TBTB to get character map between them. The path can be calculated, for example, by Myers' algorighm

  1. Get character alignments CABC_{AB} and CBAC_{BA} from the edit graph

Let TAiTA_i and TBjTB_j be the i-th and j-th character in the text TATA and TBTB, respectively. CABC_{AB} is a mapping from TATA to TBTB such that CAB,i1CAB,i=jTAi=TAjC_{AB},i \neq -1 \land C_{AB,i} = j \Rightarrow TA_i = TA_j. For example, TA=f0o,TB=fbooTA = f0o, TB = fboo then CAB=[1,1,3],CBA=[1,1,3,1]C_{AB} = [1,-1,3], C_{BA} = [1,-1,3,-1]. We can calculate CABC_{AB} and CBAC_{BA} from the shortest path on the edit graph. If there exists diagonal edge (i1,j1)>(i,j)(i-1,j-1) -> (i, j) in the path, CAB,i=jC_{AB,i} = j and CBA,j=iC_{BA,j} = i. If there doesn't exist any diagonal edge to j(i,j)\forall j (i, j) then CAB,i=1C_{AB,i} = -1.

  1. Get ALABAL_{AB} and ALBAAL_{BA} from the character alignments CABC_{AB} and CBAC_{BA}

Now we can calculate the desired ALABAL_{AB} and ALBAAL_{BA} from the previous calculated CABC_{AB} and CBAC_{BA}.