Memory: For the video system | RISC OS Rambles

Support for the video system

Cached screen

In RISC OS 4 another improvement to the speed of the system was to make the screen cacheable. This meant that a lot of operations that read and write to the screen would work faster. The difference was noticeable for certain operations, but had the downside that because the data was in the cache and not flushed to the screen memory before a VSync occurred, parts of the updated screen might not appear for a moment or two. Depending on what the machine was doing, the data might not be written from the cache to the screen until quite a bit later. This problem was reduced by causing a cache flush if the screen had been written to on every VSync (or every other VSync, or every 3rd VSync, depending on configuration).

Knowing that the screen had been written to was handled through the abort system. ARM processors, when used with a MMU, has a feature where a page entry can be associated with a 'domain'. The domains can be controlled independently of the page table entries, allowing them to be switched en masse. Each domain can be in one of 3 states - 'no access' (ignore the page table entry and always fault), 'client' (obey the page table entry), and 'manager' (obey the page table entry except for the permission bits, allowing all both read and write to the page).

Because the domain 'table' is a single register in the processor, and doesn't need to be synchronised, this means that large blocks of pages - even sparsely distributed across the memory map - can be made accessible (or inaccessible) with a single processor register write. Most page table entries will be domain 0 - because the domain bits are unset. The screen, however, is configured as domain 1 by the Kernel under RISC OS 4. The screen can be marked as entirely aborting by just changing domain 1 to be 'no access'.

This is what the system does for the screen, and then on the next screen access the system will abort and know that the screen was accessed (because it's a domain abort). It sets a flag to say that the screen was accessed and will therefore need a cache cleaning, marks the domain as accessible and re-executes the instruction. Actually RISC OS never used to actually check which domain fault was raised, so if you used any other domains on the system, generating a fault would cause the system to hang, repeatedly re-executing the same failing instruction whilst uselessly re-enabling domain 1. That was fixed, obviously!

Every VSync (or as configured), the flag that marked whether the screen had been accessed would be checked and if the flag was set, the cache would be cleaned, causing all the pending data to be flushed out. This was actually the fallback solution used by the system when running on pre-Phoebe hardware - Phoebe would have had a memory mapped register in video space that would be able to be checked for video memory accesses, thus avoiding the need to trap aborts in the area.

This method of handling the cached screen was embedded into the Kernel's handling and whilst it was fast, it was completely unusable as an abstracted system. Not only that, but the screen memory - the VRAM that the screen used - was actually treated like part of regular memory. It was able to be used as regular memory if needed. If non-VRAM memory was available the system would occasionally 'rescue' the VRAM, replacing it with the regular DRAM. The reason for this was that the VRAM on Phoebe would be quite a bit slower than that of regular memory so using it in that way would be bad for performance.

Additionally, the screen memory's Dynamic Area was actually managed by the Kernel and allocated explicit pages from VRAM (on a 1MB boundary when the cached screen was enabled). All of which was just not going to work as the video system became abstracted.

The way that Pace dealt with the screen was to just lift the code that handled initialisation and manipulation, and put it behind the HAL interfaces. This was absolutely the wrong way to go for RISC OS as a whole (in my opinion) - it is quite possible that their direction after that decision improved, but it has been many years and I haven't been following.

In RISC OS, interfaces are provided through modules, providing vectors handlers and SWI interfaces which provide their well defined, and isolated APIs. The video system is not special, and should not be treated as such - it is a device like any other. My view of RISC OS was that it remained a modular system, and one that each component could be viewed in isolation, conforming to well known APIs which allowed them to be replaced or augmented as has always been the case with RISC OS components.

To do this, the video system needed to be abstracted properly, with proper vector calls. Allowing multiple video devices to be present. Allowing the drivers to offer accelerated operations where they could provide it. Allowing fallback to existing software operations if they did not. Allowing extension for features that we hadn't yet defined but were going to be very excited about in the future. Essentially the Kernel had nothing to do with any of these things - it shouldn't care.

Memory detection

So... the first thing to changes was that the screen memory needed to go. Normally the Kernel would detect the screen memory on start up, just like regular memory, and it would be added to the pool of pages that could be allocated (at the end, so that it should only be allocated last, but it would still be available if needed). This was removed - VRAM would only be used for video. This simplified the system by removing the need to handle VRAM separately, and meant that the video system was completely responsible for its memory. It also meant less memory for users who has slightly smaller screen modes, but this shouldn't have been a significant issue as the size of the VRAM is limited to 1MB or 2MB - not a huge loss. It is a bigger issue for A7000 users, where there isn't any VRAM, and the video memory came from regular RAM - so 1MB would be dedicated to the video system. However, in order to keep a sane system I felt that this was entirely acceptable.

This meant that the video system (VideoHWVIDC, in this case) needed to do the memory detection for the VRAM, as the Kernel (or System Initialisation code) wouldn't be doing this for it. It is not a particularly complex operation, especially as the code already existed. In non-VIDC hardware drivers this would involved other hardware specific checks - probably just reading a hardware register that gave the video memory size, which would be a lot simpler!

Dynamic area ownership

Next, the screen dynamic area needed to move - the Kernel could no longer control 'The' Screen Dynamic Area. As discussed in the Graphics ramble, the screen dynamic area was moved to be handled by the module, which reduced the number of system dynamic areas. And the 'cursor' dynamic area (which wasn't actually a dynamic area but a section of other areas such as the heap and stacks) was moved to the video module.

DMA area

This is where the things get a bit hairy - and I have simplified the sequence because in some of the intermediate versions of the Kernel, and graphics drivers, were partial implementations split between the Kernel and module. The cursor area needs to be just some regular memory that holds the shape of the pointer, which VIDC can get at. Normally you would just allocate a block of memory with SWI OS_DynamicArea for just one page (as that is all it needs, if I remember right). The problem is that you need that memory to be able to be used for DMA by VIDC. All 'normal' memory is. On Stealth (the StrongARM plus memory card, which Castle produced) this isn't the case - the memory on the card isn't accessible by VIDC.

Prior to the removal of the cursor area from the Kernel's control, the memory had been dedicated to it from the on board memory as part of the allocation during the operating system initialisation. When the work had been done on the operating system for the Stealth card, a new flag had been introduced for dynamic areas - 'DMAable'. This would only use the memory which was able to be used for DMA as a candidate for allocation. This flag was used to ensure that the video area's cursor allocation worked.

Physical areas

The screen area itself needed to be mapped to the physical VRAM, once it had been found. It has always been possible to request specific pages of memory be mapped to a dynamic area through the dynamic area handler 'PreGrow' entry point. However, the pages allocated in this way must come from the real memory - they cannot be arbitrary physical addresses. As the VRAM had been removed from real memory, this meant that such remapping was not possible.

Pace used a different way of mapping physical pages, using a pool of anonymous logical address space, accessed through SWI OS_Memory. I didn't feel that I wanted to commit to an anonymous allocation scheme. Everything about the way that components were developed for RISC OS focused on having control over the system and improving the accountability (as explained in some of the rambles on modularity), and their interface was completely counter to that.

The control that dynamic areas gave, plus the extant logical address space search, meant that there was already a perfectly good way to allocate memory regions which could be reused sensibly for physical areas. Using the SWI OS_DynamicArea calls and interfaces meant that the regions could be named, which aided the accountability, and could be controlled by their owners without being affected by others. Reusing the Dynamic Areas also meant that SWI OS_Memory and SWI OS_ValidateAddress operations could be handled appropriately for those ranges (there's arguments that they should not be handled in the same way, but I fell on the side of reusing the interfaces where they seemed appropriate).

It fitted with my general scheme for how RISC OS memory was managed - all areas were attributable to Dynamic Areas of some form; all memory could be managed through the SWIs that manipulate them; modules manage their own memory areas; there are no anonymous regions of memory. Not all of these were completed - there were still some anonymous regions, but many of the previously unlabelled regions were given areas to ensure that they fit the scheme.

Abortable areas

Anyhow, the screen area became a physical area, managed by the VideoHWVIDC module. The special screen cleaning behaviour couldn't be used in this way though - the screen was uncached. This reduced its efficiency over that of the older implementation, so I wanted to get it back. Additionally, marking the logical screen memory as abortable would be vital for any accelerated interface. Accelerated hardware (at least those which I have seen) takes one of two forms - either there are register operations where you give the controller the parameters you want, and then issue a command to a register - and the operation happens, stalling until it is complete. Or you write the operations into hardware memory (or registers) and then issue a command, and the operations is placed into a queue to be processed when the hardware becomes idle.

This latter form of pipeline is more common, and - depending on the hardware - any direct reads or writes to the video memory which happen whilst this queue was being processed might appear before the operation. Or they might be interleaved with the operations of the queued command. It is a classic race condition, which is bad - we need our Direct Screen Access to work even if there are accelerated operations. The way to do this is to reuse the aborting system discussed previously. Once an accelerated operation is in progress (that is, in the queue), the screen area can be marked as aborting, and if any operation to read or write the screen is made, we wait until the pipeline becomes empty. Then we re-enable the access to the screen and perform the operation. The application (or module) is completely oblivious that it is being interleaved with accelerated operations, except that maybe the simple load or store took longer than it might expect, and the order is again deterministic.

And so, we needed a way to make the dynamic area able to handle its own aborts. Part of the abort handling code for generalised abort trapping was already present in RISC OS 4. It didn't work in all cases, nor was it complete, but the SWI and the handling of aborting instructions was updated so that it did work as expected for the majority of instructions - I think I missed SWP and the coprocessor operations (although they are less important for the cases I initially intended to handle). It didn't support any Thumb, and the half-word operations weren't available (reliably) on the processors we supported, so their support was deferred until later.

This work had been done previously and tested for a couple of things. In one example, I had written a clone of a memory mapped hardware MP3 decoder chip such that you could write to memory just like the MP3 decoder chip and the result would come out of AMPlayer. There were also a couple of examples of virtual memory that I wrote, using sparse dynamic areas and abortable areas to map in files on demand. It was amusing to see a large JPEG being rendered very slowly because the file data was only ever in memory in 4K chunks. It was also useful; handling the single page is a worst case, so exercised more of the code.

The abort interface that had been written originally used the SWI OS_AbortTrap to perform registrations. This allows any arbitrary region of memory to be marked as abort trapping, such that any aborts within that region would call the handler. This was useful for testing purposes, but is a little generic. My original thought was that regions of logical space that were previously allocated to hardware might be trapped in this way. For example, the IOMD area could be trapped like this, and suitable operations could be performed when registers were accessed in those areas despite there not being any hardware registers there any more. However, that would mean that the area used by IOMD would have to be reserved such that no other dynamic areas were allocated within it. I am not averse to that, but I want to use Dynamic Areas for all memory regions.

The interface for the Dynamic Areas is actually just handled through the SWI OS_AbortTrap interface internally, but to the client they see their Dynamic Area handler called with a new reason code to say that an abort has happened. The handler can either fulfil the request, by mapping in the memory and performing the operation themselves, or by doing whatever is necessary to pretend that it happened. Memory never need actually exist at the abortable location - although if it doesn't then repeated operations will repeatedly call the abort handler.

In the video system's case the video area's abort handler is very similar (but a re-implementation) of the code that was in the Kernel. The code is less efficient, because it now goes through the generalised abort handling code, and doesn't re-execute the instruction but instead emulates it. However this does give more flexibility.

As part of an experiment to see how well the interface could be used to provide emulation of hardware, I created a basic module to emulate the VS1001 MPEG audio codec. For simplicity, I set up a small region which provided the SCI (serial control interface) as one word per register, and a single register as the SPI. Data could be fed into the SPI address as bytes and would be buffered to AMPlayer through its streaming interface. The control interface could be read and written to control things like the volume, reset the player and read the position of playback.

It was all pretty simple to implement, and it did find some bugs in the abort implementation that were then fixed, which was useful. You probably wouldn't set up such a controller in the way that I had emulated it, but the point wasn't to emulate the hardware that could exist. The purpose (aside from testing the interface worked properly) was to prove (if only to myself) that it was possible to provide emulation of hardware mapped registers through the interface.

It might become significantly easier to provide mappings for older hardware which wouldn't be available elsewhere, such as IOMD or VIDC. Whether that was sensible to do was a different issue, but I wanted to have the ability to do so if it was.

Domains

The only other bit that was left was the handling of memory domains. These couldn't remain solely owned by the Kernel - there might be multiple video drivers that wanted to do the same tricks for acceleration purposes, and they couldn't both use the same domain. Another flag was added to the domain creation to say that the area wanted to be in a domain. There were a few limitations on domains, but these would be hidden from the caller - they must be megabyte aligned allocations and must be on a megabyte boundary in logical address space. By informing the Dynamic Area of this, the constraints could be met more readily.

There are, however, only 16 domains available. Whilst it might be possible to emulate more, I didn't want to consider this at this stage - the interface should allow the handling to be changed if we wanted to use emulation later. The 'Domain Dynamic Area' flag meant that a domain would be allocated for that area from a small pool and assigned to all memory allocated to that area. Domain 0 would remain the general system area, Domains 1-13 would be for general (but specific, as they are limited) use, and domains 14 and 15 would be reserved for the system. I had ideas to flag system workspace with these domains, but hadn't committed to that yet - reserving areas was easy at this stage, wouldn't hurt and would make it simpler to use them in the future, rather than worrying that there might not be enough.

A new SWI OS_DynamicArea reason was created to control the domains so that these could be changed easily - admittedly they could be controlled by just changing the MMU domain register from a privileged mode, but the point was that the Kernel controlled the domains. If, in the future we wanted to emulate more we would have that option. Also, the Kernel never actually exposes what domain has been allocated, so the client cannot directly manipulate it. So there. I'm not completely convinced that this was the right decision, but it could always have been changed in the future.

All these changes together made the video abstraction possible. It isn't something you can just drop in, as you can understand, and - more importantly, none of these interfaces are at all specific to the video system. They are vital for it to function the way that it does currently, but the functionality that had been driven by that development would be useful to other components.

There are a lot of interesting things that could come out of these changes - not least of which is virtual memory mapping on a Dynamic Area basis by using both Sparse and Aborting Dynamic Areas. But also the option to provide hardware emulation, either for hardware that doesn't yet exist, or is in legacy products, is very useful when it is a simple job to throw together a module that handles a few hardware register mappings. OK, so actually implementing an emulation of some hardware is a bit harder, but the point is that it is easy to get the framework to do it.

Name/Nickname
Email address
Date	Wed, 20 Feb 2013
Comment