AppPatcher

The AppPatcher module has been present since, I think RISC OS 3.7. It was introduced to patch code which would not function on the latest version of the operating system by modifying the code as it loaded. The StrongARM notes describe its function and the manner in which code must be structured in order to be patched through the Service_UKCompression entry point.

Other modules (such as the UnSqueezeAIF module) performed any decompression for the compressed code such that the AppPatcher could perform the patching as required. If the decompression module cannot do its job - either because the compression is not known, or the format has been subverted - the AppPatcher won't be able to do its job, thus preventing the code from being checked or patched for anything that cannot be handled, and preventing forward compatibility.

For applications, the code to decompress the applications was wrapped around the application code. It would normally just be executed as it stood on pre-RISC OS 3.7 systems.

However, because the code used to decompress was self-modifying, it wouldn't work on the StrongARM (which has separate instruction and data cache). The solution to this - making the code work without updates being required from developers - was to have a module (the UnSqueezeAIF module or others) perform the decompression in a safe way. Acorn did this to ensure continuity of the existing applications. Without it, it would be likely to cause user frustration if their existing programs stopped working. The decompression gave the AppPatcher the opportunity to do its job and fix any other issues which had been found.

It was a particular frustration that this interface had been defined for RISC OS 3.7 and was working fine until Castle released a compressor which was not understood by the UnSqueezeAIF module. Someone - well meaning, I'm sure - had noticed that the decompression code produced by the squeezer wasn't safe for running on the StrongARM because it didn't call SWI OS_SynchroniseCodeAreas in order to ensure the modified code was seen by the processor. So they added one. They clearly didn't realise that this interface was required for the patching mechanism to work. The only thing that making the change did was to break the patching mechanism - on a RISC OS 3.7 (or later) system the decompression code already worked, because of the UnSqueezeAIF module.

I contacted Castle about this, and informed them that I had updated the UnSqueezeAIF module to recognise the new signature so that it could decompress the code and thus patch things properly. Like other times I had attempted to contact them about issues, I got no response. That is, until their next release of their tools when the code appeared to have been changed to no longer match the signature that was there. This was already after the Kernel/GPL debacle, so I gave up trying to do anything that might work with them. If they wanted to actively break things after having been informed of workarounds, then that was their choice. Sadly that meant that users ended up with things not working - there was very little I could do if they were going out of their way to make things not work together, without any communication or justification.

Anyhow... There's a lot of code, both third party and core code, which uses path variables wrongly. It seems that even the original versions of the Internet libraries would use file specifications like '<InetDBase$Path>Hosts', rather than correctly using a path alias like 'InetDBase:Hosts'. This sort of misuse would work only if the path variable contained only a single entry - which negated the purpose of using a path variable in the first place.

The AppPatcher would recognise these and convert the incorrect path uses into the correct form so that they would work.

I also updated the AppPatcher to operate on modules, through the services which informed clients of new modules starting. This meant that any of the libraries, or common code patterns, which had been used in modules could be updated so that they were safe.

The module socket library had a minor bug that meant that it could corrupt the 'errno' returned by the call. A small patch to the error block return allowed it to return the correct error number. This seemed to affect the HTTPStream module supplied with Oregano.

In the Kernel there had been notes about R9 being preserved over SWI calls especially for ImpressionSpell. As this was an unnecessary restriction for an old module, and in variance to the documentation, I removed the special preservation of R9 from the Kernel SWI dispatch. The AppPatcher was updated to recognise ImpressionSpell and patch its SWI entry sequence to correctly preserve the registers. ArcMemMan also suffered from this problem and was also added to the patch list.

Neither of these patches were 'necessary' on earlier systems because the Kernel explicitly would preserve the register, but once fixed the Kernel could lose the special case.

At one point it became necessary to move around some bits of the Kernel workspace - mainly because hardware support was changing and it became easier to manage some things without retaining the legacy allocations which we no longer used. In particular, the old vestiges of early hardware we stripped from the Kernel where they made things cluttered. One of the many things that was removed was the flag for MEMC1a. Whilst the Kernel didn't need it, <sigh> some versions of EtherH appeared to rely on it. At least version 4.33 would explicitly check for it by probing the memory directly, and would find that the wrong data was there, causing it to fail to receive anything from the network.

One solution would be to fix address &112, which contained the flag, to always contain the value that was expected. This wasn't really reasonable given how old the component was. Additionally, version 4.52 did the right thing, checking the system type first. The fix that I created was a new entry in the patcher to replace the load of the flag with a fixed value, so that it would always work on the 'modern' hardware, regardless of the content of the private area.

When I began to update the SharedCLibrary to handle 32bit operations, I needed to compare its behaviour with that of the Castle produced version. There were a few problems with their version. In addition to a few inconsistencies with the previous specifications, which wasn't really a big issue, it also violated APCS in some circumstances, which could cause problems for users, instability and - in some circumstances - system crashes. Despite previous problems, I sent an email to let them know that there was an issue, and after receiving no reply again, decided that since the behaviour could not be fixed by on-the-fly patched (I realise it could have been fixed by patching but the amount of work necessary was prohibitive, and given the previous actions to actively break the patches, I decided not to pursue that course), I chose to raise report an error in the desktop (or to the regular output if we were not in the desktop) but not prevent the module from loading.

In a previous change I'd added code to prevent the Castle SharedCLibrary from loading directly, in order to prevent problems with applications, but this merely caused users more frustration and was reversed in the next release. A mistaken decision, I guess. In general it tended to be that warnings about there being problems were treated as 'FUD' by a few vocal individuals.

Whilst I attempted to match Castle APIs and releases where they were documented, investigate issues where they were not, and to try to highlight issues, I'm pretty certain that I saw no times when they considered that anything that I did was really worthy of investigation or being pursued. Consequently, I was regularly having to move APIs that had been produced where it was found that they had used them for other purposes. As communication was almost entirely one way.

There were exceptions. Early on there were a few exchanges which resulted in both the implementations of the FreeSpace64 entry point for ShareFS being consistent on both versions. Later, we gave them a complete copy of the Toolbox stack so that the two could begin to converge. I never heard anything more of that, so I'm guessing it was pretty much just lip service to any sort of working together.

Execution formats

No one area has provoked such a bad reaction as those for the changes to the formats of executables. Or more specifically, the conclusion of a 10-year notice period that prior formats were no longer going to be supported.

Back when the StrongARM support was added - in RISC OS 3.7 - there were a number of support notes issued. They detailed the ways in which support had been provided for the processor, and the how it could continue to be supported. The AppPatcher, the need to avoid certain instruction sequences, the deprecation of self-modifying code and introduction of SWI OS_SynchroniseCodeAreas for those who needed it, some notes about the future direction of the system, and a few other things. One of the things that came out of the need to be able to safely ensure that an executable could run (and provide a safe environment for it to do so if necessary) was that the executables would require headers to be present to describe their use.

These headers were required by the Unknown Compression/Patcher interface in order to function. As work progressed towards 32bit it was obvious that running code which expected to be executed in 26bit mode in a 32bit mode would be fatal - most likely in a privileged mode, as that's where the most danger lay for the execution formats and where there was the greatest likelihood of failures.

To me, executing an application without at least validating that it was of a suitable format is a little bit of insanity. Within programs it is required that APIs validate their arguments and to ensure that they are not doing anything they ought not to be. Data formats (generally) have signatures and checks within them to ensure integrity. To not have the same checks on the code you intend to execute is ... insanity. There's really no other way to put it.

The behaviour had been retained previously because that's the way we roll around here. There were obviously applications that hadn't been upgraded by people and that was to be expected.

One of the complaints (or jibes, depending on who it came from) about all versions of RISC OS was the fact that it was easy to cause it to break. In some cases it is unreasonable to expect otherwise from its design, but that is no reason to retain inherently bad interfaces. Throughout the work on RISC OS, I had tried to ensure some degree of validation on APIs and parameters, so that there weren't bad consequences of passing invalid data to functions and APIs. There are many examples of where this was done, and these principles were extended to the executable format.

One problem with code executed in 32bit mode, rather than 26bit mode, is that instructions take on a different meaning when executed within a privileged mode. This can be coupled with the basic problem with the way that USR mode applications set up their environment (and it's completely correct to say that part of the problem is that the application controls the environment, not the Operating System).

During the application start it will register environment handlers, which will handle exceptions, messages and a few other things. These exception handlers will be entered in a privileged mode. Assuming that the application is not 32bit - or doesn't know about the shape of the system that it's now running - it's going to very quickly run into trouble, probably causing an abort. The abort handler is entered in privileged SVC mode and begins to do its job. Now it also doesn't know about 32bit mode, so it rolls on and uses the changed instructions - TEQP, or an LDMFD rx, {...pc}^ - which then do different things to what they expected.

In the former case, the processor status might be unaffected when the application expected to change interrupt status, or change to USR mode. In the latter case, the saved processor state is restored - as you've probably just been called to handle an exception, or the exit, that means you're probably going to restore back to USR mode, with the USR mode stack, but you're still executing code as if you expect to be in SVC mode. If you'd been unlucky and had an FIQ or IRQ in the time between your entry and the instruction, then you'd end up back in SVC mode (as that was the saved state) - which is lucky really as that's what you expected. If you'd ended up in USR mode, though you'd probably find that you aborted pretty sharpish. Thus repeating the cycle and leaving you with a stiffed machine.

This and other pitfalls (I've only described one possible case - there are many others) mean it is pretty important to be sure what type of code you are running.

Another problem that happens when you don't validate the executable that you're about to run is that you don't know how much space is needed for it, and many of the executables don't check properly. Buggy checks in the decompression code (when present) mean that it is possible to stiff the machine if the application space isn't large enough to accommodate it. The AppPatcher tries to take care of this when it can understand the format, but if it finds that it cannot do its job (because the compression format was unrecognised, for example) there's little it can do.

One of the problems with so many RISC OS interfaces was that they had to work within constraints which were created by code that didn't conform to the APCS, or a even a known application state. This meant that executing code in USR mode wasn't an option in general. This constraint remained partially because of BASIC's conformance to its own standard - and there were a lot of BASIC programmers, and programs. Another part of the problem came from assembler code which had their own rules - there wasn't even the basic assumption that the stack lived in R13. APCS-A applications were still supported, as well, which were in that class of not using R13 as stack.

The SWI OS_ChangeEnvironment interface has already been discussed, and the problems it caused were not going away any time soon - it would eventually be removed but that wasn't going to happen at this point. With the changes to the SharedCLibrary (see another ramble on that), the support for APCS-A was removed, so that part of the problem went away, but the rest remained.

So essentially there were multiple problems caused by unconstrained code being executed, whether on 32bit or 26bit systems, but made worse in 32bit systems. And there was the fact that this change had been announced 10 years previously.

The AIF module was created not to enforce this specifically, but to move the implementation of the Absolute type out of FileSwitch (thus allowing it to be replaced if necessary), and allow it to be clearly vectored through FSControlV. The vectoring, and the definition of the API meant that it was significantly easier to replace, or augment, if necessary. The AIF module would, itself, issue the necessary services for the compression modules and patcher to decompress and run the program. If the file format was not recognised - that is, the absolute was non-AIF - a separate service would be issued to allow other handlers to take over. And if the file was 26bit but the system is executing in 32bit mode, a different service would allow other clients to handle that. Systems like a 26bit-on-32bit emulator might handle those types of files.

Additionally, debugger services would be issued to check if an active debugger wanted to take control of any debug areas that were present. If none responded, then it was safe to ignore the debug areas - saving loading time.

FileSwitch would also issue new SWI OS_FSControl calls for running Untyped files (those with load and exec addresses), as these would also be unvalidated. These could be handled by just claiming the vector.

Utilities also got their own SWI OS_FSControl call, as they were another executable format which wasn't validated. The TransientUtility module handled these. Unlike the Absolute files, no format had ever been declared for these, so whilst it was less safe, the format was not mandatory except on 32bit systems.

Module flags had been defined in RISC OS 4 in order to support service call tables for fast reject selection. Pace later defined the format of the 32bit flag in this module word, allowing the Kernel to reject modules it could not understand. Similarly, Podules had a table added to their header which declared that they were 32bit safe ('32OK' magic word!)

All the changes were quite flexible and almost entirely modular, which allowed other clients to either replace, or augment, the functionality. Completely in keeping with the way that RISC OS has always been developed, I believe.

The AIF changes, though, got a reasonable amount of criticism. Initially I had intended to not support the ability to run non-AIF code. After all to get around this, you merely had to install a module which handled the non-AIF code, and this was pretty simple to do. There were already configuration options available to relax the checks on the executables, such as the checks on the length of the code being correct and the like. I was persuaded to include the option to allow the execution of bare executables, but refused to create any sort of tool for !Configure to control it on the grounds that ... it's been 10 freakin' years, people - get your act together and at least try to catch up with the fact that Operating Systems move on. It's hardly groundbreaking stuff. The amount of people wanting to move on to 32bit but wanting everything to stay exactly the same completely baffled me. Whilst there's an acceptance of a reduced development base, if you're going to move forward you really need to move forward. Developing new things on poor foundations is just setting yourself up for a disastrous fall, and this was a perfect example of where things needed to change.

So in summary, I completely stand by my decision to change the behaviour of executables to protect the system, and provide better control in the future. Anyone who disagrees has clearly a different view on the manner in which RISC OS development should have progressed and that's fine. They're wrong, though (in the nicest possible way).