OCR reference language interpreter

For this project, I wanted to make an online interpreter for the OCR exam reference langauge. The spec for this is available here: https://www.ocr.org.uk/Images/558027-specification-gcse-computer-science-j277.pdf. This is an ongoing project which I actively maintain, so if you do have any questions or suggestions do feel free to contact me 🙂

To start with, I took most of the tokeniser from my old language project. This works by iterating through the input string character by character, and then when the character changes from a letter to a symbol for example, it knows to start a new token. There is also a special flag for strings. In the end we are left with a list of tokens that consist of a type, a value and a locator (I’ll get to this later). The list of token types is as follows:

  • Identifier (e.g. print)
  • Number (whole number, e.g. 5, – note that this doesn’t include negative numbers, for reasons I will talk about later)
  • Symbols (e.g. (, ., +)
  • String (anything surrounded by quotes, e.g. "Hello world!" – although the token value itself does not include the quotes)
  • Space (whitespace)
  • Float (a decmial number, e.g. 0.2 – again, note that this doesn’t include negative numbers)

The tokeniser function takes in the program string and then returns a list of these tokens. At this point, I take a differing approach to convention – I don’t create an AST. Since this language is interpreted, there is no need to – I can just look for certain tokens in a statement (e.g. if the statement starts with the word while then it is a while loop).

From here, the program begins parsing. We iterate through each token, and upon encountering a newline we execute the tokens we have collected so far and then move on to the next one. However, there are special cases, like if statements and for loops, which have multiple newlines as they contain multiple statements. The solution for this is to have a sort of bracket counter – for each if keyword encountered, increase the bracket count by one, and for each endif keyword encountered, decrease it by one. When the bracket count is 0 and a newline is encountered, only then can we start parsing the next statement. Once we have these single line statements (except statements with blocks in, but I’ll get to those later), we can begin execution.

Execution begins by trying to determine which statement is being used. In effect, it looks through a list of keywords which each have different execution routines associated. If it can’t find any of those, it looks for assignment (=) within the statement. Failing that, it will try to treat the statement as a function call, before finally giving up and throwing an error message.

However, this error message is not very useful, as we have no idea where it occurred. This is where the token locators come in! Each token has a locator which tells us exactly where it was positioned in the original string. In this case I do this by keeping track of the newline count in the tokeniser and simply adding a line number to each token. This way, when an error message is raised, we can pass a token locator in order to highlight exactly where the error occurred.

Token locators in action

Back to execution – at this point, statements containing more statements like if and for have their contents parsed again, and then they execute them conditionally or repeatedly depending on the statement. Having the parsing phase seperate from the execution phase gives a tiny performance increase, as the statement does not have to be parsed every time a loop is run.

Now, it’s on to the expression parser. Here is where it gets interesting – as expressions need to be converted into trees before being executed. This is difficult, as you might imagine, due to things like the ambiguity of brackets (function call or not?), precedence and unary operators. A lot of this was ported from my other project, although it was faulty. The basic concept is as follows.

First, we check if there is any symbol at the start of the expression, like NOT or -. These symbols will be unary operators, so we create a node that contains the operator and captures (and parses) the rest of the expression. This also captures the negative numbers I mentioned earlier: -5 is a calculation, not a literal.

After this, we iterate through the tokens looking for binary operators with the lowest precedence first, from right to left – contrary to what we are taught in school. So, to start with, the parser looks for + and - starting from the right side of the expression, and splits the expression into two halves when they are encountered, recursing and parsing again. This is why we use the operators with the least precedence most – because they are evaluated last, they actually capture the most. Take the example a * b + c * d – because the multiplication is evaluated first, we end up with (a * b) + (c * d). After iterating through, the process repeats for operators with higher precedence, until the whole expression is parsed.

You may be wondering how brackets work with this, but it is actually surprisingly simple – similar to the statement parser, we keep a count of how many brackets have been opened, which we increment upon an opening bracket and decrement on a closing one. Now, we just tell the parser to only split when the bracket count is 0 – effectively giving anything inside brackets the highest precedence as they are only evaluated once all other operators have been evaluated.

When it comes function calls, the expression parser does not touch their brackets as it is never told to evaluate them. We catch this behaviour at the end of the parsing function, where we do some basic pattern matching to detect if the remaining tokens are of the form identifier(...) (member functions use a similar technique). We then parse each of the functions parameters, split by commas, and return an object representing the function and its parameters.

In the end, we should end up with a tree-like structure – e.g. for the expression (3 + a) * 2 / 4 - magic(b, 5), the following object should be generated:

Expression {
    type: "op",
    value: "-",
    lhs: {
        type: "op",
        value: "*",
        lhs: {
            type: "op",
            value: "+",
            lhs: {
                type: "literal",
                value: 3
            },
            rhs: {
                type: "var",
                value: "a"
            }
        },
        rhs: {
            type: "literal",
            value: 2
        }
    },
    rhs: {
        type: "fun",
        value: "magic",
        args: [
            {
                type: "var",
                value: "b",
            },
            {
                type: "literal",
                value: 5
            }
        ]
    }
}

This expression can now be executed by traversing the tree using the operations given.

There are also modules for scoping and variables, but that isn’t very interesting and works mostly as you would expect – although functions are first class objects (they can be assigned to variables and passed to other functions etc.), which is quite nice.

I also implemented a line by line debugging feature, which is also relatively simple – it pauses execution between each statement (although I did have to change every function to be async to do this), and skips over functions unless a step into flag is set. Step out uses a similar technique, and step over just continues execution until the next statement. There is also a trace table available, and making this was a simple case of getting all variables in scope.

After this, all that was left was to put the UI on top. HTML doesn’t allow colour inside a textarea, which is what I chose to use for the input. With the benefit of hindsight, a better choice may have been to use an editor like Ace, which I am fairly sure can be used with custom grammar. Anyway, the way it works is by copying the value of the input into a div, and performing a regex replacement on it. For example, this is the regex to highlight literals (except strings):

/\b(([0-9]+(\.[0-9]+)?)|true|false)\b/g

Note that this doesn’t highlight the - in front of numbers. I wanted to do this, but had an issue with Safari which for some reason doesn’t support lookbehind in regex!

As mentioned earlier, I actively maintain this, so please do get in touch if you have any ideas or problems 🙂

View project

2 thoughts on “OCR reference language interpreter

    1. Hi Mr Abbas,
      I wouldn’t mind at all! Do feel free to share, I would greatly appreciate it 🙂
      Thanks,
      John

Leave a Reply

Your email address will not be published. Required fields are marked *