BASIC

One of that areas that I left alone mostly was that of improvements to the BASIC interpreter. When we got RISC OS 4, we found a couple of applications which failed to work with the changes that had been made to the assembler. Acorn had added support for half-word operations (for example, 'LDRH'), which could be misinterpreted - and were in the assembler source built in the applications.

I changed the behaviour so that the instructions would only be enabled if the new 'OPT' flag was set (bit 4, I believe), indicating that the extended instructions would be used. This allowed the new features to be used, whilst retaining backwards compatibility. There were some other changes that I made later which did not affect the behaviour of the application under normal circumstances, but would improve the application in general.

For example, the 'SOUND' statement previously called SWI OS_Word 7, which was the interface used by the BBC to play sounds. These sorts of legacy interfaces need to be abandoned, and that interface was one of the ones that had been moved from the Kernel into the LegacyBBC module. The LegacyBBC module was being populated with all of the old interfaces which were obsolete and could eventually be completely removed. Obviously, callers like BASIC needed to be updated to use the modern calls.

BASIC had gained the ability in RISC OS 3.6 to be able to use a mode specifier string as a parameter to the 'MODE' statement. However, rather than handling this itself, the code instead directly called the Wimp to set the mode - resulting in the desktop mode changing when you changed the mode from within a BASIC program using a string. This was one of the main reasons that I created the new SWI OS_ScreenMode calls, allowing BASIC to use mode strings without involving the WindowManager.

There were some small changes to the assembler to support negative numbers when passed to certain operations, but these did not affect any existing code - they would have failed previously, so it wasn't a real problem.

I did make a small improvement to the way in which errors were handled which would help a few times when the application would end up locking the entire machine. As errors were being handled the memory allocated to the application would be checked for validity. If the memory had been taken away (as would happen if the application had accidentally reduced its slot size too far) it would have corrupted the top of the stack, and make the error handler fail (recursively).

Rather than hang the machine in such cases, the error was instead changed to a return to the caller with an error, exiting the application. This was unclean, and files might be left open and other resources leaked, but it was better than the alternative. There was still an issue if the user accidentally reduced the space used by the application completely, as even the return address would be removed from memory. But that was left as an exercise for another day.

There was one change that was made very early on that made BASIC significantly better, without changing a single piece of functionality. Better for me, that is. The entire source came as unformatted assembler, with a single space indent where normally you would line up to the instruction or operand in the line above. It was incredibly hard to read, even with context colouring - and this is on top of the fact that BASIC was quite special in the first place.

I wrote a very simple piece of Perl that reformatted the code to make it more manageable. It made some of the comments in the code harder to read, but there were sufficiently few of these that this did not really hurt <smile>.

I say BASIC is quite special. It is a great bit of code. But if you want to maintain it, you are in for a little bit of a headache. For a start, it is all assembler, and it was written by Sophie Wilson, who is something of an expert when it comes to clever ARM code. Despite this being early Sophie code, it has still got a lot of cleverness in it; and a lot of the BBC.

Sophie wrote the original BBC BASIC, which was very impressive at the time, especially when you remember that the 6502 only had 3 8-bit registers. Some of the legacy of the original BASIC makes its way into the ARM implementation - which isn't to say that it doesn't take advantage of the extra registers and space available to it. It makes pretty good use of its environment, but that legacy has imposed some structure on the implementation of the language.

For a start, we are not using APCS at all in BASIC. It didn't really exist as a calling standard back when BASIC was originally conceived, and in any case the code wasn't going to interwork with anything else, so it didn't need to use such a standard. It does have its own register allocations, though. If you were calling ARM routines directly, the environment had a number of defined registers which you should preserve if you wanted the code to work well. If you were to call back to BASIC they needed to be valid, and if you wanted any exceptions to be reported usefully, you needed to preserve them.

There are no general rules for function calls within BASIC. Some functions take their parameters in known registers, and others use different registers. There may not even be a single entry point to some of the routines - many of the routines had slight variants which were entered offset from the 'start' of the function, because this allowed other registers to be used in place of the ones initialised earlier in the function entry.

There was no distinction between the labels which were within part of the entry sequence and those that were at the start of the function entry, but that was fine because there were never any comments that described the function's entry sequence anyhow. To Sophie, these were obvious, I guess. 'MUNGLE', 'AEEXPR' and 'FNINSTANT' are all obvious names when you know what they do.

I am sure that Steve Drain, who has probably spent more hours staring at BASIC code than many other people would probably agree that they are simple and obvious. Steve wrote the very impressive extension 'BasAlt', which added a lot more functionality to BASIC - other people had patched their own modifications, but Steve went farther than anyone else with new keywords, data types and other extensions.

BASIC doesn't use the stack like many other applications. The stack contains the program execution context first and foremost, and holds parameters for the implementation secondary to that. Essentially this means that as the program is executed, there is very little on the stack which pertains to the registers preserved between calls. Most of the stack contains the program state - as you enter a REPEAT loop, a token (which happens to be 'UNTIL') and position is pushed on to the stack. When an 'UNTIL' is reached, the stack is checked for a corresponding 'UNTIL' token on the stack; if there is such a token, then the position to return to is known. If not then there is a mistake in the code, because you cannot use 'UNTIL' anywhere other than within a 'REPEAT' loop - and you will get an error reported.

Variables are held in linked lists, hashed by the first letter of the variable name. The names are stored in the list without the first character, as this is implied by the list that they are in. Localised variables would cause a copy of the variable value to be preserved on the stack, and a new storage created for the variable. As functions were unwound, these preserved values can be restored.

Error handling is implemented by recording the stack pointer position at the point at which the local error handler was installed (which is why the position of the error handler in the procedure or function entry was important). Triggering the error handler would perform an explicit restore of all the variables which had been preserved on the stack up to that stack pointer. If the stack pointer did not refer to a location that was understood, an error would be raised (which may explain the somewhat cryptic error message "Attempt to use badly nested error handler (or corrupt R13)").

Procedure and function locations are cached when they are first called. Their location, and a reference to the first argument, are stored as logical variables (for symbols that cannot be variable names). This means that subsequent calls can look up the location quickly.

Function calls are also amusing, because the entry of the function call enters the function code, and when the '= value' is encountered it is evaluated. Then control returns to exactly where the function call was made but with the returned value ready for processing by the expression parser. It is really obvious, and it is quite elegant.

There is also a code locality cache which keeps track of recent results of searches to make them even faster. For example, the result of a search for the end of a 'CASE' statement would be held in a hashed cache location. If it was needed again soon after, it would be able to reuse the previous result. It could get overwritten by other cache results, which would just mean that if it was needed again a full search would be used.

It is all so very fun, because nothing about it is pedestrian. Every little bit of code is a different, and you have to keep your wits about you when you are playing with it.

C linkage

Why, when I didn't add any substantial features to BASIC, do I know all this (and many other little things that are far less amusing)? I had a play with it recently. I wanted to prove to myself that I wasn't too rusty at ARM code, and could do something 'useful' by just charging at it. I say 'charging at it', when really I mean that I wad spent a week of walks in to work deciding what I wanted, and how I would like to achieve it, interspersed with a few dips into the code in the evening to get my head around what the code did and why. So, in that respect it was a reasonably planned exercise.

I didn't really want to do anything too impressive; I just wanted to make a small change to the way that functions and procedures were called. It had been a oft asked question on the news groups 'how do I use C code from BASIC'. Maybe not that oft asked, but enough that I wrote a lengthy Usenet posting about it, some time back.

My change was to add the ability to call C code directly from BASIC. Functions or procedures could be called just as you would normally, but instead of (for example) locating a function in the BASIC program, 'PROCfoo(bar$)' would locate the C function called 'PROCfoo' marshal the bar$ variable into the called parameter block, and execute it. Functions would return values by just returning the BASIC value reference and this would be handled just like a function return.

It is not too hard to add an object format loader to BASIC - written in C - but it introduces some excitement in terms of the way that the environment is handled. As mentioned above, there are certain registers that expect to be preserved when executing BASIC code otherwise it won't be able to report errors properly. These conflict with APCS, which is used by C code. So there end up being 3 distinct contexts that the code may execute in at the point that the error handler is called.

Firstly, there is the plain BASIC execution context, with all the variables in the place BASIC expects them to be. Then there is the APCS context, entered for one of our utility functions - as part of the implementation of the C object linkage code. And finally there is the APCS context of the called functions in the object files themselves. The last two are similar, although the mechanism by which the expected BASIC registers are recovered differs slightly.

There is a little excitement with the handling of the stack as well. As mentioned, the stack contains the BASIC context, and that is not affected by the C code, but anywhere that the C code is invoked knows what the stack limit is - this is the base of the stack. Initially, the stack limit will be the limit of the allocated variables when the C code is first entered. However, the variables might change - an expression evaluated by the C code may call functions that cause variables to be allocations. Or the C code may call a BASIC procedure directly, which might allocate variables (or even the cache entry for that procedure).

This means that the stack limit will rise. Obviously we need to check, as we return, that there is enough space left on the stack for the APCS calling convention - and generate an error if not. But what might not be obvious is that the caller will have preserved its view of the stack limit, and so returning to the caller function would seemingly lower the stack limit below the actual limit.

There are two ways to deal with this. One is to reserve a small region of the stack when the C routing is first called, and use the stack limit checking calls to provide a stack extension below the current allocation. Essentially, using small stack chunks so that the application never truly reaches the variable limit intentionally (and if it does, we have run out of space and can report it).

This method is wasteful, as the base of the stack always reserves a set amount of space for its own use, and this would be wasted on every extension. Additionally, any large stack uses (for example, large structures on the stack) would probably exceed the small allocation, causing all the remaining space to be wasted and the stack to have to extend far more.

The other method, which I chose, was to unwind the stack contexts, fixing up the stack limit as we go, if and only if the stack limit has changed. This makes the limit take effect in the earlier calls which logically would not need that extra space (if they did, the current function would never have been able to be called).

It is kind of fun, but it works just fine. As you thread in and out of C and BASIC entry points, each entry needs to reassert its knowledge of the stack limit as APCS comes into force. If there is a deep nesting of C calls, and we are repeatedly calling BASIC functions which modify the variable limit, this can be time consuming. I could not see any easier way to implement the limit checks.

There are a bunch of support functions provided, some from the C library, and some specialised for BASIC. For example, in many cases it is important to treat strings like BASIC code, so having 'strcpy' isn't what you want. So there are functions that take BASIC format strings, and produce C format strings, and vice-versa. There are standard C functions that work just like you would expect - I implemented 'printf' which could take the normal format strings and would print out the message to the standard output. As well as the normal formats there were also formats for BASIC variable types, so that they could be output.

The BASIC variable printing types would honour @% so that they would be formatted with the correct number of decimal places (unless you specified a number), and the whole output itself would honour the setting of 'WIDTH', and correctly count up characters such that 'POS' was still correct.

At the same time, I introduced BASIC to the debugging libraries that I had written to be used in all the assembler code. This made things a world easier to debug, as full formatting was available, and I never needed to care about the state of the registers other than that there was some stack available.

The function look up in the C object code was performed in such a manner that it reused the existing procedure and function look up tables, and the locality cache entries were also used where appropriate. It was pretty fast even without any special optimisations. The stack manipulation for the C entry sequence was somewhat amusing as it had to include all the parameters that were being passed to the routine, but this just made the whole process a little more challenging.

The result was pretty unexciting, though. You would load the object code with a simple 'LIBRARY "object-filename"'. It would be identified as being either a Code or Data type file and the contents checked before loading. Once loaded, the object remained in memory, much as a regular LIBRARY would.

From there you could call PROCcfunction or FNcfunction with or without parameters and the relevant symbol would be looked up and called. That's really about it.

There are a whole world of other things that BASIC could do with. I had begun implementing longer strings, but these would invariably break applications which were expecting the strings to be limited to 256 characters. I've already mentioned that BASIC could have done with some better handling of Toolbox object - something that Steve Drain had already tackled with some success in BasAlt.

Objects

Matthew Godbolt's !IRBasic was a completely independent implementation of BASIC which only supported integers, but did have a pretty advanced object system, together with garbage collection. It showed that you could provide objects in BASIC, albeit by some odd hacks. Aside from the issue of implementing objects, there is the equally important question of how they are named and used.

Floating point variables have no suffix; strings have a $ suffix; integers have a % suffix. What would make the most sense for objects, and would not be incorrectly identified in existing code? It might be nice to use ^, to align with Pascal, except this is the exponentiation operator. There is the & symbol, which is only used for hexadecimal notation. This seems like a candidate, but it is valid to use an alphanumeric sequence ending in an ampersand followed immediately by something else - var&deaf is a valid sequence in a PRINT statement, and is interpreted as if there was a comma separating the var and &deaf.

Still this is a quite obscure case, and the ampersand could be thought of as similar to C reference. Whether members were accessed with a separating period (as is common in most languages) or not is a similarly interesting. This might result in a syntax like items = list&.length, and as a function, value = FNlist&.pop. Should the FN/PROC notation even be retained for method calls? In !IRBasic, routines were defined by prefixing the class name, for example DEFFNlist_pop.

The object reference itself ('this', or 'self' in other languages) was accessed in !IRBasic as @% as this had little use in its implementation), and was never included in the parameter list for the called functions explicitly - it was implicitly set. This led to the interesting notations such as PROC@%.add(items) which would call the PROCclass_add method on the current object.

In the same way, using @& as the 'this' would match the style, but look a bit awkward. There is no reason why the object reference couldn't be called this&, except it's not quite as succinct.

On the other hand, if '@' were selected as the object specifier it you could have a syntax like 'FNlist@.add(item)', or even omit the dot entirely. The at sign used in this way is invalid normally, so it is quite safe to do this. The might mean that you have this@ as the object reference, or just plain @.

!IRBasic allowed inherited functions could have the same method invoked in the inherited class through PROC@, which, if you were using the @ symbol as the type specifier would look rather nice, and not interfere with anything.

There would be times when objects would need to be stored into data structures, or passed around. As objects are just pointers to related structures (and inheritance models), their representation would be most easily made as a single integer. Converting an object to an integer would be easiest achieved with the INT(object@) operation, although you might want to be more obvious and use the POS(object@) token to make it clear that it wasn't a plain conversion (POS used followed by brackets is an invalid sequence, so this would be safe).

Converting back to an object is less easy to decide on. Because any keyword or sequence that is used must not conflict with existing code, it is quite difficult to come up with an obvious use that would not affect otherwise valid code. I am reasonably tempted by using something like object@ = PTR(value), as the use of PTR without a following # is a syntax error.

This also leads to a possible use of type inspection, for example PTR$(object@) might return the name of the type of the object, following the rule that operations that return string results must end in a dollar.

Construction of an object is easy to define; the NEW keyword can be reused, for example NEW primes(5). Declaration of the object is more interesting, and I reckon something like DEF OF classname(args) would read quite nicely and could be followed by the initialiser code. Member variables could be defined within the definition, with LOCAL variables being instantiated with the class, but any variables which have global scope defined within the initialiser would actually be defined to only be visible to the class itself as static variables.

Destructors, such as you might have them, could reuse the DELETE keyword, as DEF DELETE classname. Deleting an object explicitly could be achieved with a simple DELETE object@.

All object operations would be handled by reference - if you wanted a copy operation, you would explicitly perform the copy (maybe by convention the objects would have a 'copy' method). This would be more efficient most of the time, as you really don't want to be creating lots of duplicate structures every time you pass an object to a function.

So you might end up with something looking like:

DEF OF list
  LOCAL length
  LOCAL datapointer

  REM Initialiser
  @.length = 0
  @.datapointer = 0
END OF

DEF PROClist_add(value)
  LOCAL newptr
  REM We'll invent a new syntax for memory allocation as well.
  IF @.datapointer = 0 THEN
    newptr = DIM(4)
  ELSE
    newptr = DIM(datapointer, @.length*4+1)
  ENDIF
  IF newptr = 0 THEN ERROR &1, "Could not add to list"
  newptr!(@.length*4) = value
  @.length += 1
  @.datapointer = newptr
ENDPROC

DEF DELETE list
  REM Invent a new syntax for freeing memory as well.
  DELETE newptr
END DELETE
Prototype code for objects

Is that at all useful? No idea. It's nice to sketch out how you might use the language before you even think about implementing it. Is it worth it even? Probably not.

There are a few questions raised by the above example code, such as the difference between ENDPROC and END DELETE; the former is a single token, where the latter is two, separated by a space, to match the use of DEF. That space does confuse things a as far as the parser goes. Without it, the ENDDELETE could be a new token.

The use of ERROR to report problems implies that there could probably do with some exception handling, and that is an extra kettle of fish. Maybe something could be easily added to the error handler to allow exception objects.

And of course, there is my wild stab at a memory allocation syntax that might not be quite as messy as doing it all yourself by hand, as I have done in the past. Whatever happens, all this lot is going to break on earlier versions of BASIC, just as BASIC 5 programs broke when run on a BASIC 3 system. The important thing is to make them break, rather than work but do something odd. This is all pretty pie in the sky thinking, anyhow.