Rule Parser for NLU

Objective:

Given an input text (utterance), list of words or regex patterns (entities) and the rules defined by the combination of entities (intents). Match the utterance with the corresponding intent.

Motivation:

Though this problem sounds just a text parsing problem, where we just need to do lookup and regex matches, but think of a scenario, where we have M lists of entities for lookups and N intents, which are combination of M entities with the following possibilties:

Intent (I1):

Necessary to have entities E1 and E2
E3 is not necessary, but optional
Should have either of E4 and E5

Now, we can think of this as an algorithmically challenging problem with very high time complexity. Thus, we borrow some graph data structures to address these types of issues for us.

Techniques:

In this section we will discuss the following modules, which are the backbone of this Rule Parsing Engine.

Entity: Lets define an entity (E1). E1 can have a list of words and phrases and then given a word, we need to check if it exists or not. Notation, $E \in [e_1, e_2, e_3, …, e_l]$.
Pythonic way,
```
# Tuple of entity name and corresponding list of words.
E1 = ('E1', [e1, e2, e3, e4])
```
To address this problem, we will use a data structure called Trie and match the string similarity based on edit distance.

What is Edit Distance ? Given two strings str1 and str2 and below operations that can performed on str1. Find minimum number of edits (operations) required to convert ‘str1’ into ‘str2. Following are the allowed operations, having equal cost:
1. Add
2. Delete
3. Substitute
Read More

Reference to minimum edit distance

What is Trie ? A Trie, (also known as a prefix tree) is a special type of tree used to store associative data structures. A trie (pronounced try) gets its name from retrieval — its structure makes it a stellar matching algorithm.

Read More

Now we will use both of these approaches to solve the above problem. Let’s see what we want to achieve with the data structure.
```
trie = Trie(edit_distance_max=1)
trie.insert("helloworld")
results = list(trie.lookup("helloworl"))
results = list(trie.lookup("elloworld"))
```
Implementation

Now we have got a lookup for entities, we want to have entities with no overlaps with the neighbors, i.e. given a list of tagged entities,expand out valid parse results. A parse result is considered valid if it contains no overlapping spans. Since total confidence of a parse result is based on the sum of confidences of the entities, there is no sense in yielding any potential parse results that are a subset/sequence of a larger valid parse result. Thus, we will leverage “Expander Graphs”.

Reference

Expander Graphs: From Entities to Valid Parses

After using our Trie to scan an utterance, we are left with a collection of potential entity matches. The problem is that these matches can be messy and overlapping. For instance, in the utterance “show me flights to New York City”, our Trie might identify:

E_DESTINATION: “New York City” (span: 20-33)
E_CITY: “New York” (span: 20-28)
E_CITY: “York” (span: 24-28)

We intuitively know that “New York City” is the best and most complete match. We need a systematic way to discard the shorter, overlapping entities (“New York”, “York”) and combine the remaining, non-overlapping ones into the most confident final parse. This is precisely where a graph-based approach shines.

While the term “Expander Graph” has deep roots in pure mathematics concerning sparse yet highly-connected graphs, in our context, we use it to describe a graph that helps us expand from a jumble of candidate entities to a single, valid, non-overlapping sequence. Let’s call it a Parse Graph.

How it Works

The core idea is to model the relationships between entity candidates as a Directed Acyclic Graph (DAG). We then find the “best” path through this graph, which will correspond to our final parse. The process involves three steps:

Node Creation (Entity Filtering): First, we process the raw matches from the Trie. For any set of overlapping entities, we only keep the one with the longest span. This is based on the “Maximal Munch” principle, ensuring we select the most comprehensive match. Each of these filtered, longest-span entities becomes a node in our graph. Each node stores the entity’s details: its name, text span (start, end), and a confidence score (e.g., based on the edit distance).
Edge Creation (Building the DAG): We connect the nodes to represent valid, sequential order. We draw a directed edge from a node $A$ to a node $B$ if and only if entity $B$ appears immediately after entity $A$ in the utterance and they do not overlap. That is, if A.end < B.start. Because edges only point forward in the text, our graph will have no cycles, making it a DAG.
Path Finding (Dynamic Programming): Our goal is to find the sequence of entities (a path in the DAG) that has the highest total confidence. This is equivalent to finding the longest path in a DAG, where the “length” of a path is the sum of the confidence scores of its nodes. This problem can be solved efficiently in linear time ($O(V^2)$ or $O(V+E)$) using dynamic programming.

Example Walkthrough

Let’s trace the process with an utterance: “Book a table for two for tonight at 7pm”

Trie Matching: Suppose our Trie finds the following potential entities:
- E_ACTION: “Book a table” (span 0-12, score: 0.9)
- E_PAX: “two” (span 17-20, score: 1.0)
- E_DATE: “tonight” (span 25-32, score: 1.0)
- E_TIME: “7pm” (span 36-39, score: 1.0)
- E_DATETIME: “tonight at 7pm” (span 25-39, score: 0.95) -> Longest match
Node Creation: We filter for the longest spans. “tonight” and “7pm” are part of the longer E_DATETIME entity, so we discard them and keep E_DATETIME. Our nodes are:
- Node A: E_ACTION (span 0-12, score 0.9)
- Node B: E_PAX (span 17-20, score 1.0)
- Node C: E_DATETIME (span 25-39, score 0.95)
Edge Creation: We connect nodes that are sequential and non-overlapping.
- A.end (12) < B.start (17) -> Draw edge $A \rightarrow B$
- B.end (20) < C.start (25) -> Draw edge $B \rightarrow C$
Our graph is a simple path: $A \rightarrow B \rightarrow C$.
Path Finding: In this case, there is only one complete path through the graph. The final parse is the sequence of entities on this path:

[E_ACTION, E_PAX, E_DATETIME]

The total confidence is $0.9 + 1.0 + 0.95 = 2.85$.

Implementation Example

Let’s translate the theory into Python. The following code demonstrates how to filter overlapping entities and find the best parse using dynamic programming.

import pprint

def filter_overlapping_entities(entities):
    """
    Filters a list of entities to keep only the longest ones in case of overlaps.
    """
    # Sort by start index, then by length descending to prioritize longer matches
    entities.sort(key=lambda e: (e['span'][0], e['span'][1] - e['span'][0]), reverse=True)
    
    filtered = []
    last_end_pos = -1
    
    for entity in sorted(entities, key=lambda e: e['span'][0]):
        # Only consider entities that don't overlap with already selected ones
        if entity['span'][0] >= last_end_pos:
            # Among all entities starting at the same point, we need the longest.
            # Find all entities starting at the same position
            contenders = [e for e in entities if e['span'][0] == entity['span'][0]]
            
            # Select the one with the maximum end position (the longest one)
            best_contender = max(contenders, key=lambda e: e['span'][1])
            
            if best_contender not in filtered:
                filtered.append(best_contender)
                last_end_pos = best_contender['span'][1]

    return sorted(filtered, key=lambda e: e['span'][0])


def find_best_parse(entities):
    """
    Finds the path of non-overlapping entities with the maximum total score.
    This is a longest path in a DAG problem.
    """
    if not entities:
        return []
        
    n = len(entities)
    # scores[i] stores the max score of a path ending at entity i
    scores = [e['score'] for e in entities]
    # parents[i] stores the index of the predecessor of entity i in the best path
    parents = [-1] * n

    for i in range(n):
        for j in range(i):
            # Check for a valid, non-overlapping sequence
            if entities[j]['span'][1] < entities[i]['span'][0]:
                # If a path through j gives a better score for i, update it
                if scores[j] + entities[i]['score'] > scores[i]:
                    scores[i] = scores[j] + entities[i]['score']
                    parents[i] = j
    
    # Find the end of the best path (the one with the highest score)
    best_score = -1
    end_index = -1
    for i in range(n):
        if scores[i] > best_score:
            best_score = scores[i]
            end_index = i
            
    # Reconstruct the path by backtracking from the end node
    path = []
    while end_index != -1:
        path.append(entities[end_index])
        end_index = parents[end_index]
        
    # The path is reconstructed backwards, so we reverse it
    path.reverse()
    return path, best_score

# --- Dummy Data based on the walkthrough ---
# Utterance: "Book a table for two for tonight at 7pm"
raw_entities = [
    {'name': 'E_ACTION',   'text': 'Book a table',     'span': (0, 12), 'score': 0.9},
    {'name': 'E_PAX',      'text': 'two',              'span': (17, 20), 'score': 1.0},
    {'name': 'E_DATE',     'text': 'tonight',          'span': (25, 32), 'score': 1.0},
    {'name': 'E_TIME',     'text': '7pm',              'span': (36, 39), 'score': 1.0},
    {'name': 'E_DATETIME', 'text': 'tonight at 7pm',   'span': (25, 39), 'score': 0.95}
]

# 1. Filter the raw entities to remove overlaps
print("--- Step 1: Filtered Entities (Maximal Munch) ---")
filtered_entities = filter_overlapping_entities(raw_entities)
pprint.pprint(filtered_entities)
# Expected output: E_ACTION, E_PAX, E_DATETIME

# 2. Find the best parse from the filtered entities
print("\n--- Step 2: Find Best Parse (Longest Path) ---")
best_path, total_score = find_best_parse(filtered_entities)

print(f"\nBest Parse has a total score of: {total_score:.2f}")
print("Entities in Best Parse:")
pprint.pprint(best_path)