RISC OS has been, from the outset, a highly modular system. It is not only provided in the form of modules with clearly defined interfaces, but it is encouraged that developers extend these through vectors and other extensions.
My view has always been that the mechanism for extending the operating system is through those existing methods - modules to provide functionality and drivers, extended and claimable vectors for new features to allow for replaceable functionality, and (suitably understood) backwards compatibility in all APIs.
The vector extensions were generally known from the BBC series, and modules were somewhat akin to ROMs, so this was not a particularly difficult system to work in for myself (or most people who were upgrading). The vectors covered a lot more, though and were chainable so that there could be multiple claimants (aside from the obvious benefits, I think this probably came about because of the pain that was produced by the single claimant filing vectors on the BBC, which meant that providing multiple concurrent filing systems was a bit awkward). The vectors and module extensions were used to great effect by people like Computer Concepts (GDraw, DitherExtend, for example),
Much of this modularity is a requirement of the user base that the OS has. Users are highly resistant to upgrading, because the system has been able to be modularly extensible (and also because that is just the way that they are), and so any upgrades must remain as backwards compatible as possible whilst not restricting the development of the whole.
RISC OS 3 provided a lot more functionality over RISC OS 2, sufficient to be called a new version. Many new components appeared, and whilst you could upgrade RISC OS 2 to RISC OS 3 features in some respects by loading later versions of the modules, there were some significant changes that meant that wasn't possible. This said, 3.0 was a rushed (in my opinion) release which was later upgraded to 3.1 - RISC OS 3.0 systems have some very real problems which cannot be fully mitigated by softloading components.
The reason that even replacing components doesn't give you a 'new version' is mostly due the Kernel and the collusion between modules. Collusion is, in general, bad - particularly for interoperability. In some cases it is as simple as one module 'knowing' about parameters that are passed from one component to another in blocks, but which aren't documented. In others its more invasive, such as direct access to private workspace used by a second component - the Kernel zero-page workspace is a treasure trove of variables that were poked and prodded from all sorts of places.
Is it sad that you remember that
&FF8 is the current domain
ID, used by the WindowManager, FileSwitch and others? Or that
&104 is the escape flag which is set under interrupts when
character would be inserted into the current input stream and so is a
'quick' way to perform SWI OS_ReadEscapeState. Oh, and who can forget
&10C - the system clock, for those times when calling
SWI OS_ReadMonotonicTime is just too much effort.
It all comes of
spending a lot of time stripping these dependencies from the internal
"Internal" (that is, OS supplied) modules should not know about the implementation of other modules (which is the main thing that I mean by 'collusion'), except where it's absolutely necessary to perform their primary goal. There's not many examples of that exception that you can really justify, but typical ones include knowing the workspace of a module that you're trying to affect in order to do something that wasn't possible before, or which was done badly - for example as a patch, or to provide compatibility with other versions of the module or Operating System. A few patch modules fall into this category. So do some Toolbox modules - in some cases they need to do things that weren't supported under earlier versions of the WindowManager, so having detected that they are necessary they will prod areas, or perform special gymnastics to ensure that they work properly. Oh, and obviously no 'external' modules should do this sort of thing either.
The other main area of collusion is to have special interfaces between the modules which are not documented, or documented as 'internal use only'. It's fine to restrict access to certain types of interfaces which aren't yet stable, but restricting them unnecessarily is just plain wrong.
It's really not great, but it's a product of a modular system which tries to provide for everything that went before. Acorn, before we ever worked with RISC OS, had already decided that they had to keep the RISC OS 3.1 support going. With Toolbox releases, despite Toolbox being a huge drain on such systems, we still tried to keep the modules working. There were also other contortions with Toolbox, as a group of releases of the Toolbox module itself would fail to shutdown cleanly under some circumstances, introducing some interesting hoops to jump through to ensure that new modules were loaded safely.
I had seen some developers comment that the 'internal use' APIs should be publicly documented. With very few exceptions that's really the wrong solution. The 'internal use' APIs were, generally, marked as such because the implementation was subject to change, or just plain bad. The right solution is to determine what's needed, and why, and to design a good API to go along with it, rather than exposing what was known to be a bad interface and being tied to using it in the future.
Quite a bit of the work throughout the Select developments was to remove collusions and to make dependencies safe. It was, initially not possible to reorder some modules in the ROM (*Unplug notwithstanding) because their dependants would fail to initialise if the modules they required were not present. This has so many problems with it - in particular it means that when testing you have to ensure that you kill everything off, and restart the modules in the correct order, in order to test that they will work in a ROM as expected. It also means that those modules that expect others to be present may get upset if they are killed off and they are unaware.
In a lot of cases, the stacks of modules will have services to notify one another of their initialisation and finalisation, allowing registration and activation of the functions as necessary. The sound system is one of the most obvious places where this can be seen - kill off SoundDMA, then restart it, and you'll hear the sound stop, then restart as if nothing had happened (internal state configuration notwithstanding).
There's always going to be some state lost when modules are replaced, but during development you need to be able to replace parts of the system to ensure that your changes work. If the system crashes whilst you're debugging something because a component can't cope with being reloaded, that's pretty annoying.
Periodic resilience reviews
Thus it was that there were occasional bug hunts based on lobotomising the operating system. It sounds so much more important when you use the name 'Periodic resilience review', which it what it was, but you've got to get down to the crux of the work - which is really just killing modules until something breaks.
Usually the tests would result in strange crashes which shouldn't have happened, or I would notice that the section of code being reviewed really could not work without other modules being present, and took no action to prevent itself getting into a bad state. These usually ran along the lines of loading up a bunch of applications which were typical, and a spread of functional uses (some sound, some graphical, some network, some written by Computer Concepts - who were pretty good at doing things a little out of the ordinary ). Then kill off modules, or configure them in ways that weren't normal (or possibly even sensible).
Invariably, this showed up some issue in other modules, or the applications. Errors reporting death of applications wasn't too much of a worry. Death of the machine, or aborts were issues, and would be noted down. Then the process would be repeated, killing more off to see what happens. It is surprising how much survived ColourTrans not being present. The sound system, as has been said, was improved to ensure that all the cases where you killed and restored modules were safe (a lot of that came from AMPlayer development and the many ways that the system could be taken down by getting something wrong).
Repeat until we find that the system becomes unusable. Usually the module that you've just killed was a little more important than you expected, and there was something to be done to tidy things up.
The other way of lobotomising the system was by replacing functionality with valid or semi-valid alternate functions. Making SWI OS_Byte calls return errors, or odd reports. Essentially, keeping to the documented API, but varying the response. Much of the time there was nothing that happened, but sometimes you learnt a little about what's safe to change and what's not.
Much of this is intensive, manual, work. You can't do much else whilst you're in the process of doing it so it has to be targeted to a particular problem case. Well, unless I was feeling bored and wanted a bit of a challenge.
Part of this process was encouraged by the BBC game 'Phantom Combat Flight Simulator'. It was pretty bad - wireframe world and a wireframe triangle of an interceptor trying to shoot you. But one thing that it did have was damage based on how you were shot. So you might shoot down the enemy, but end up with your engines only running very slightly above your stalling speed, or your wings might get clipped such that you're constantly banking to one side unless you hold the stick to counter it. Such damage made landing far, far more fun. So it is with lobotomising the OS .
I got quite frustrated with my own and many third party modules being unable to work properly in a modular manner. Many things would fail if modules weren't present, or they would do odd things. Additionally modules would use the wrong information to determine a feature set. To try to help people with this, I wrote a specific document about what people should aim for with their modules. It's always useful to do that for other people because it provides you with something you can look back on and cringe because you don't comply with it yourself. Mostly I did, but actually writing down what you're trying to achieve - and why - is always a good plan.
Anyhow, the reason for this lobotomised testing was to ensure that for development, and for users, the system was resilient in the face of failing components. They might be small corner cases that you find by doing this sort of testing, but additionally they might be a case that some developer is hitting regularly because of how they use the system. At the same time you get to investigate code that you might not otherwise have looked at - and that may help you to find other issues which were only visible by having examined the code. I certainly couldn't count the number of bugs that were found and addressed just by doing this sort of testing.
Ah, the memories... "The MessageTrans bug" was a real frustration for a lot of people. It presented itself as an error when anything tried to use MessageTrans, and reinitialising the module would not work - it had caused its internal workspace to be invalid and would abort during finalisation. It was known about for some time, and the 'fix' was to reinitialise the module, see the abort, then write 0 to the MessageTrans workspace, and reinitialise it again (I think - maybe it was a small offset from the workspace base). This would leak memory but leave the module functioning enough to restart, from there on it should be ok.
That sort of thing was a real issue for some people - particularly developers. The bug was that MessageTrans constructed a chained descriptor on the SVC stack under some circumstances, and when this was done in a TaskWindow which then swapped out the SVC stack the chain became broken and could not be recovered from. Usually it happened if you closed a TaskWindow whilst it was in the middle of a compilation, or pressing Escape at just the wrong moment within the TaskWindow. It made using TaskWindows something of a game of Russian roulette.
Whilst that couldn't be addressed specifically by the modularisation, the requirement should be that finalisation should never fail. If it does, it should still leave the module in a state that it can be killed by a second attempt. Usually that means marking lists as freed before unlinking them, and similar things.
As I've said above, some modules wouldn't start unless their dependants were present. In a lot of cases I managed to update modules so that if the functionality they required was not present, the module would become quiescent. They could then wake up when the required modules were started - or just try using them later when they were needed.
In some places the existing interfaces are just too entangled to remove. The keyboard drivers were one of these cases. Not only is there magic knowledge about the tables that the Kernel passes to the keyboards, the documentation that does exist is very poor. The sound system is similar (quite a few people have said to me that they couldn't make head nor tail of the sound documentation), especially the description of the private blocks for sound channels and gating. But the keyboard handler is a bit of a nightmare.
Kernel module extraction
My view, as I've said quite a few times to developers in public forums, is that RISC OS is modular, and there are some things that hurt that modularity. Having lots of 'core' functionality in the Kernel is one of the ways that it hurts - and one that I set out to address. There are a few reasons for this beyond those that you apply to general modularity in the rest of the Operating System. Mainly there's the legacy issue within the Kernel - it's old, it's been played with by quite a few people and it colludes heavily with many other components (and with parts of itself when it ought not to). There's also the issue that in many cases the code just hasn't been looked at, and understood, by anyone in a long time.
One of the other problems with the Kernel is that it has areas of code that seem like they should be self contained, but in reality they have special knowledge about other code. They either directly access sections of the Kernel that shouldn't be related, or poked at variables you might not expect. For example, the VDU system was pretty self contained, in outward appearances. In reality, there were bits and pieces accessed in a lot of places. The SWI OS_Byte calls could easily poke at parts of the code, and whilst sometime the interfaces were clean, mostly they were just ugly.
You might expect that a VSync would be handled by the VDU system, but you'd be wrong. Rather than the IRQ system detecting a VSync and calling in to the VDU system, it performs the effects itself - determining whether the flash state has changed and acting on it, and performing special screen cache cleaning. On the other hand, it did call down to the VDU system in order to flash the cursor. Having said this, I didn't actually get to tidy that particular bit of code up much.
Spending more time with the code is one great way to spot problems, especially if you're trying to ensure that it works in the same way but in a different place - that is, extracting components.
The Kernel had a lot of things in it which didn't really need to be in the Kernel. They were handy, but they could be provided separately. This introduces a dilemma - we don't want to require components to be present in particular order but many things do require basic functionality. So a lot of the basic Kernel functions that were removed are still a little special in that they must exist early in the ROM. Others, though, are irrelevant to booting really - ModuleCommands doesn't really need to be present early on for example.
One of the earliest candidates for being moved out of the Kernel was the System Variable implementation. There's a bit of complexity involved in it, and isolating it into its own component made it far simpler to develop. That said, because it maintains the entire state itself, restarting it results in losing all your variables. That's another form of lobotomisation, though, so useful for testing things . Ideally there would be a service to say that the system variables module was going away, and another for coming back, so that code variable claimants could re-register themselves. However, as you lose of lot of other things - application and module resource locations, aliases, some configuration and your filing system context - this probably isn't all that useful.
Of course, if I had left the collusion with the Kernel in the module, the System Variable workspace would still have been hung off a zero page location that would be picked up by the new module - thus retaining the modules list. Downside is that the Kernel workspace is then fixed with respect to SystemVars, and the workspace format used by the module for its variables would also be fixed, otherwise you wouldn't be able to change between versions of the module.
The extraction of the system variables meant that the search code could be made a little more efficient - because it's easier to make such changes when you're looking at just a small section of code that's entirely isolated from the rest of the Kernel. When these components were in the Kernel it wasn't always obvious how things could be improved - or what you would break by doing so.
I don't know how many of the things that are now in SWI OSSWIs
were originally in
I'm pretty sure that none of them were in
they were related to some low level functions. But just the names of the
files should tell you it wasn't a good organisation for the
The same sort of thing was done for the CLIV. The CLI had been improved for Ursula by changing commands to be searched more quickly by hashing. Aliases were also sped up in Ursula - by knowing how we can search the system variables. So we lost out by removing the collusion when the system variables were isolated. However, we gained some of this back because the faster system variable lookup that I had implemented in SystemVars owed a lot to the searching that had been special cased in CLIV.
Additionally, ripping out the CLIV from the Kernel meant that it doesn't have access to the module list in order to enumerate the modules quickly. To address this, I introduced some new services to notify modules of other modules starting, ending, re-instantiating and being made the preferred instance. In particular the preferred instance services are absolutely necessary when people direct their commands at instances of modules either by intent, or implicitly. Alex MacFarlane-Smith found a few bugs in that code because of his use of AMPlayer - using multiple instances. He would start a new instance of AMPlayer and then direct commands at that instance. I doubt many people even think about the instantiation of modules, but it's another fun wrinkle for this stuff.
In any case, the replacement CLIV originally had the Ursula hashing method for its commands, but this was a bit fiddly, so I replaced it with an alternate C implementation. Being C it could be tested much more easily, and I could get the algorithm right - and because I was requiring that there be no collusion between components, this also meant that the test code could just use 'standard' proper calls to perform its test operations on a live system. In tests, it performed better than the original assembler version when used on a large set of operations - ie the Boot sequence. Algorithmic improvements, and greater maintainability wins out.
I considered using a list of candidate commands for hashed entries, but this would mean rebuilding the list - or at least making significant changes to a number of lists - each time modules started and exited. Getting around the problem of regenerating the list on start/exit was easy enough - we mark each module as having been cached and only on the first command operation do we begin to hash things. In particular this meant that the large number of modules which started during the ROM initialisation wouldn't result in exponential amounts of work for each module start up.
Rather than keeping lists of the commands, I created a fast reject bitmap for each module based on the commands that were present. This still means that a few modules may be searched wrongly, but the number searched will be significantly smaller, and the storage requirements are far less.
Other modules were far more obvious as replacements - CMOS (configuration data) access was in the Kernel, which meant that we ended up having direct hardware access in the Kernel - an unreasonable state of affairs. Pace had decided to go about this problem by providing what they called a 'HAL'. Essentially this meant "lift bits out of the kernel wholesale, not changing any APIs and just dump them in something that you can vector through". Unlike splitting things into module with proper RISC OS APIs and the like, this had the problem that the already quite poor internal interfaces that were 'abstracted' were left in this HAL where other hardware had to be shoehorned to work with. Maybe that's a bit uncharitable, but I have the completely diametric view that hardware functionality is a modular thing, and we have a well defined, and (relatively) well implemented system for doing so already in RISC OS. These components that provide hardware access are no different to those that come on Podules, so should not be treated as special. And, of course, many hardware things shouldn't sit 'under' the Kernel in a HAL, but should live above it.
This said, in order to know which modules need to be loaded (due to Unplug constraints) on start up, you need to have access to the configuration data. So... we need to know configuration in order to select modules for start up, but we also need to have modules in order to get at the configuration. Simple - a flag to indicate that some modules need to be initialised early. Not only does this address the issue of NVRAM configuration (and many other abstractions) but means that we can ensure that the PoduleManager is started early and remove the collusion that required it to be the second module. Other extension module providers would also have early initialisation here, were they to be present.
As well as pulling the CMOS data out of the Kernel, it was explicitly renamed to NVRAM (non-volatile RAM) in the module title - basing the name on the technology wouldn't have been too clever I thought . The internal collusion that was previously in the Kernel, knowing how to access the IIC bus was... also pretty crazy. We have a module that does this for us. The IIC module provides this access so we use this for our operations. The NVRAM module is actually quite simple because of this - aside from some checksums, it just calls IIC SWIs. The access interface for the NVRAM module is entirely through a new NVRAM vector.
The new vector means that any other implementation can be provided and, so long as it stores information correctly, it can be used just like the original. The newly extracted NVRAM module included the caching of the configuration contents, which meant that calls would still be faster than they had been in RISC OS 3.7 and below. There is no reason why such an implementation has to be backed by hardware. It could just as easily be read fixed at boot time, or even loaded from a file during the boot sequence (although this would lose the initial configuration and might not work for some modules initialised from a Podule).
The initialising the Operating System is slightly different as there are modules that start before others and have to be explicitly flagged as such, but they can live anywhere in the main module chain. Obviously both NVRAM and IIC need to be early initialisation modules,
I wanted the Kernel to know nothing about hardware. Its understanding of the environment it is in should be pretty limited. There still needed to be a boot environment to start it up - providing very early initialisation like putting controllers into quiescent states, setting up memory as the Kernel will expect it, and providing facilities to control the processor. The Kernel calls down to this SystemInit environment for the very few functions it needs. The SystemInit code only provides for a few operations:
- System type identification.
- Machine reboot.
- Processor identification and operations.
- Diagnostic output.
The reset call probably isn't even necessary - it could be provided by a RISC OS module, but since you do sometimes need a way to trigger a reboot early on, it's handy to do so here.
The diagnostic output is similarly rudimentary. It can indicate a colour - the RiscPC start up uses colour during early initialisation to show the stage it has reached, but obviously the Kernel can't know how to do this before the modules have started. The video system doesn't get started for quite some time after the modules have started up. In the module list for the last build I used here, the VideoHWVIDC is module 21. The diagnostic entries can also write a character and read a character. This can be used in preference to the internal VDU output through a build time switch in the Kernel (it would eventually have become a permanent feature which would be enabled through a start up flag).
The RiscPC and QEmu SystemInit code would direct the diagnostic output to the serial port, allowing the system to start without a video or keyboard driver being present. Handy when you haven't yet implemented the video module, and helps with diagnosing how far through the initialisation the Kernel has reached. Normally the early Kernel initialisation debug would be turned off, but when it is enabled you can see a lot more of what is going on.
Of course, stripping bits out of the Kernel meant that other bugs were exposed. Without a video driver being present, the Kernel would just abort when anything was written to the screen, because it never expected there to be no screen memory. Those sorts of problems aren't too hard to fix, and of course addresses the issue of restarting the Video modules - if you reinitialise a video module then there is a time when any output not have any screen to go to. Try reinitialising the VideoHWVIDC (or whatever is in use) from within a TaskWindow. It amuses me that this is just fine - and was designed to be that way.
Timers and IRQs
Abstracting out the TimerManager from the Kernel was a quite fun process, as we needed to have it present once we start the Kernel up, but more fun than that was the IRQ system. There are many different schemes for managing IRQs, and the IOMD method was one of the simplest. I tried to make something that was sufficiently documented for the current system. Much as I dislike just extracting the implementation wholesale, without changing the way that device claims are operated by the OS there wasn't really much else that could be done. The system is stuck with the SWI OS_ClaimDeviceVector API that it has, and whilst it can be extended, we can't do too much more.
I believe that the sub-chain was exposed a bit more clearly - this is
used by the PoduleManager for each of the Podules.
I did provide a new service
Service_DeviceRegister so that
hardware drivers could re-register themselves when the
IRQ module was restarted. However, I never got around
to propagating its use through all the hardware modules. So don't try
to reinitialise it. It won't work.
The Timers and IRQs were some of the last systems that were abstracted away, in the Kernel. There are still areas in there that know about memory regions and set up things in a way that is expected by other modules - I never got around to dealing with them. I assume that some of that work must have been finished off.
Relationship to 32bit
The modularity created by hardware abstraction is a different problem to that of converting the OS to 32bit. However the two were often discussed together as part of the same work - they really aren't. The main reason they're discussed together is the age of the hardware at the time - the RiscPC and A7000 were both old and parts were no longer available, and the new processors that would have been able to work with new hardware would bring the need for 32bit and other hardware as well. Additionally, the new processors that were available were tending towards System On Chip solutions, where the 'hardware' being accessed was a part of the processor.
For the System On Chip argument, I can see there's good reason to bundle the two together, but it is really a false economy - bundling hardware interfaces together because the hardware happens to be on the same chip can cause greater problems in the long run. Certain RISC OS components have been handled like this in the past - particularly the Kernel (which is a big part of the abstraction problem) which handled much of the hardware, because it also managed the RISC OS interfaces.
In many cases the abstraction happened as work was needed on the APIs for 32bit, but by the time any real 32bit work was being done, a lot of the main abstractions had already been done. There were still a number of areas to be looked at which hadn't been addressed in any sensible manner yet, but those would come later.
Another important part of the RISC OS modularity in general is that interfaces
are almost always fixed from the point they are released. With the exception of
moving them out of the way when they clash with other implementations (q.v.
the moves of interfaces which clashed with Castle, or the earlier change of
Z' icon validation command), interfaces have to be right
early on. You cannot change an interface that exists once it has been made
available to people, otherwise the compatibility between versions goes out
of the window - obviously implementations can be fixed when they're broken, but
the general interface must remain the same.
This adherence to the form of the interface from early on has meant that, in many cases, applications which worked with one version of the system would work with most subsequent versions. In some cases, interfaces were withdrawn, though. I withdrew support for quite a few BBC specific interfaces whose use was quite redundant. It's quite baffling that interfaces to access the 6845 video controller used in the BBC was still supported by parts of the VDU system - before they were removed in 2006. There should be no need to retain such things so long after the hardware has been discontinued, although that could easily be said for much of the legacy BBC interfaces.
I tried to move some of the legacy interfaces into a separate module ('LegacyBBC') which would provide the emulation necessary for them to function, but using the newer interfaces (or just plain stubs). This also included some special cases which the command line processor no longer supported.
Because interfaces tended to be fixed there was a choice during the design
phase - either get the interface right first time, or make it extensible
so that any failings could be addressed. Many interfaces did neither so resulted
in 'magic' value extensions being used to flag the extended interfaces in
later versions. The many extensions to the SWI calls for the WindowManager
are a typical example - these generally used '
TASK' in a register to
indicate the extended API was in use.
Later modules were designed to allow for such extensions by their interfaces
using a flags value in
R0. This allowed subsequent implementations
to use extensions that hadn't been decided during the design. I adopted this
for many modules in order to reduce the likelihood of future problems. Another
way to provide for extensions was to use a reason code in
which allowed for extended operations, but mostly this applied to operations
for which there were going to be variants - eg Register, Deregister, Read state
or similar. The two methods could be combined, providing a reason code and
a flags words in a single register. Modules like ImageFileRender used this
method to select the type of render being performed.
Although the flags were useful as a means to distinguish different functions, it presumed (generally) that the any implementation would fault flags that it didn't understand. Without this, it still wasn't safe to extend the interface by setting different flags. However, this prevented variants of the interface to be provided in a backwards compatible manner without multiple calls. For example, you couldn't provide a flag which provided a hint to the way in which the operation was performed if the earlier version rejected unknown flags. Instead, you would have to call the API twice, once with the hint set, and then a second time without if the first failed - which in that case probably negated any benefit you might see from the hint.
In a couple of cases, I wanted to follow the PNG principle of critical/ancillary feature flags. The 'critical' flags (say bits 0-15) would indicate those variants of the call which were not backwards compatible, where as the 'ancillary' flags (say bits 16-31) would indicate those variants that could be ignored for the purposes of compatibility. I don't think I ever used this interface, but I did suggest it a couple of times in API designs. The fact that ID3v2 used a similar form of flag for its frames probably had a strong influence here.
I did actually break the rule about APIs remaining fixed when it came to AMPlayer. I proposed changing all the SWI calls such that a flags word was introduced at the start. Although this allowed for variants of the SWIs, which was a generally useful idea, it also allowed every SWI call to have a common flag to indicate that the call should be directed at a specific instance of the module. This made it very easy to control different instances without having to manually switch them. There was a little discussion between myself, Thomas Olsson and Robin Watts about this interface before I managed to convince them that this was a Good Thing.
I still think it was good and it made AMPlayer far more flexible. A few manager applications broke because of the change, but they were fixed pretty quickly. It is only because those applications were still in active development and the developers appreciated the interface (well, I hope they did) that they were updated though. I don't believe that any other interfaces were changed in such a radical way.