Concepts

Introduction

We understand that the main function of a compiler is to convert source code from one programming language to another, i.e., to map from the source language to the target language.

In implementing Formula, the principle is similar to that of a compiler, but the process does not involve stages such as generating intermediate code, code optimization, etc.

Main Process


Principle of implementation

Lexical analysis stage

Process


Participle

Tokens are the basic lexical units in the Formula, and Formula implementations divide them into five main categories:

  1. Functions: defined in the Formula with a fixed meaning of the word, such as IF, SUM, SEARCH and other functions

  2. variables: the value of a cell on behalf of the grid, such as {header column}

  3. constants: currently contains strings and numbers, such as "result", 100, etc.

  4. operators: usually used to handle a variety of arithmetic, relational and logical, such as "+", ">", "&&" and so on

  5. Delimiters: usually used to standardize precedence or logical divisions, such as "(", ",", etc.

The list of interfaces is as follows:

enum TokenType {
  Call = 'Call',
  
  PureValue = 'PureValue',
  Value = 'Value',
  
  Number = 'Number',
  String = 'String',
  
  Comma = 'Comma',
  Blank = 'Blank',
  LeftParen = 'LeftParen',
  RightParen = 'RightParen',
  
  Or = 'Or',
  And = 'And',
  Less = 'Less',
  LessEqual = 'LessEqual',
  Greater = 'Greater',
  GreaterEqual = 'GreaterEqual',
  Equal = 'Equal',
  NotEqual = 'NotEqual',
  Add = 'Add',
  Minus = 'Minus',
  Times = 'Times',
  Mod = 'Mod',
  Div = 'Div',
  Not = 'Not',
  Concat = 'Concat',
  Unknown = 'Unknown',
}

Implementation method

Regular expressions are a powerful text matching tool that can be used to identify and extract specific text patterns, so in Formula's lexical parser, we use regular expressions to match and extract specific lexical units.

Based on the TokenType delineated above, it is easy to match the corresponding Token, and each Token finally obtained has the following properties:

interface Token {
  token: TokenType;
  index: number;
  value: string;
}

Grammar analysis stage

Participle


Grammar Analysis

The main work of grammar parsing is to parse the tokens into a grammar tree and exclude the duality according to the formulated rules, so that the final calculation can be performed according to the constructed grammar tree.

The specific syntax tree node types are as follows:

class AstNode {}

// 当遇到函数时,会生成 CallOperandNode
class CallOperandNode extends AstNode {}

// 当遇到数字时,会生成 NumberOperandNode
class NumberOperandNode extends AstNode {}

// 当遇到字符串运算符时,会生成 StringOperandNode
class StringOperandNode extends AstNode {}

// 当遇到一元运算符时,会生成 UnaryOperatorNode,类似 !{布尔列}
class UnaryOperatorNode extends AstNode {}

// 当遇到逻辑运算符时,会生成 BinaryOperatorNode,1 + 2
class BinaryOperatorNode extends AstNode {}

// 当遇到变量时,会生成 ValueOperandNode
class ValueOperandNode extends AstNode {}

With the above we can describe each syntax unit with a tree node type.

For example there is now a formula: SUM(1, 2) * 3, which finally constructs the syntax tree as follows:

  • Ast

    • CallOperandNode (SUM)

      • NumberOperandNode (1)

      • NumberOperandNode (2)

    • NumberOperandNode (3)

Implementation method

We can implement it using the algorithm of recursion and backtracking, and the following is the core pseudo-code:

parse () {
  const token = nextToken();

  switch (token.type) {
    // Process the variable logic, generate the ValueOperandNode, then get the nextToken and call parse again.
    case TokenType.Value: 
    // Process the function logic, generate the CallOperandNode, then get the nextToken and call parse again.
    case TokenType.Call: 
    // ...More Token Type
    default: // Throw Error
  }
}

Calculation stage

Participle


Implementation method

The computation phase is to recursively traverse the cached syntax tree and return the value of each level.

There are different traversal methods for different types of AstNode, so here is the description for CallOperandNode.

When calculating a CallOperandNode, the corresponding FunctionClass is taken out for parameter verification and the static method func is called to calculate the result.

There are five main types of FunctionClass:

class FormulaFunc {}

// Formulas such as FIND, SEARCH, etc. are inherited from TextFunc
class TextFunc extends FormulaFunc {}

// Formulas such as SUM, ABS, etc. are inherited from NumberFunc
class NumberFunc extends FormulaFunc {}

// Formulas such as IF, OR, etc. are inherited from LogicalFunc
class LogicalFunc extends FormulaFunc {}

// Formulas such as YEAR, MONTH, etc. are inherited from DateFunc
class DateFunc extends FormulaFunc {}

// Formulas such as COUNTA, COUNTIF, etc. are inherited from ArrayFunc
class ArrayFunc extends FormulaFunc {}

Based on the five base classes mentioned above, a variety of formula classes can be derived, and each of them implements the following methods:

class XXX extends XXXFunc {
  static override validateParams() {}
  
  static override getReturnType() {}
  
  static override func(params) {}
}