Hello and welcome to this Renesas Interactive module that ... · PDF fileHello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

1

Hello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

2

The purpose of this Renesas Interactive module is to introduce the RX architecture and key features of the RX core.

In this course, we’ll cover:

- the CPU core and pipeline

- how the CPU interfaces to memory and

the rest of the system

- a brief overview of the RX instruction set

- an introduction to the RX’s flexible

interrupt handling

- an under-the-hood look at the RX

floating point unit

- and we’ll wrap up with a quick discussion

of the high-speed memory built into every RX chip.

This session moves along pretty quickly and should be done in about 20 minutes.

Let’s get started.

3

We’ll start with a discussion about the differences between Complex Instruction Set versus Reduced Instruction Set Computer architecture, or CISC versus RISC.

Starting with CISC, where the ultimate goal is to have a small memory footprint, here are some key attributes of traditional CISC architecture. Generally, all instructions can access memory, many addressing modes and rich instructions add efficiency, and variable length instructions pack code tightly into program storage space, resulting in lower requirements for program memory size. This approach does have its drawbacks, though. CISC instructions can take many clocks to execute, CISC’s are difficult to pipeline reducing performance, and interrupt responses can be longer.

Now let’s take a look at the attributes of RISC processors, where the goal is to achieve one clock per instruction. The instructions, as you would expect, are much simpler and fewer in number on a RISC. Access to memory is only through load and store commands. Instructions are in a fixed format with typically only one or two instruction lengths available. This results in a larger code footprint, but it does make pipelining easier which in turns makes it possible to achieve the goal of one clock-per-instruction.

Wouldn’t it be great if we could take the best attributes of CISC and combine them with the best attributes of RISC? Well, RX does exactly that providing the best blend of both CISC and RISC. It capitalizes on the positive features of each, rejecting the negative ones.

From the CISC side, the RX adopts a rich instruction set with many addressing modes, allowing the RX instructions to directly access memory. A variable-byte-length instruction format allows for extremely compact code.

From the RISC side, the RX adopts a uniform register set, pipelining that allows one clock per instruction performance, and fast interrupt response time.

As a bonus, RX has a floating point unit, which really helps in real-world control applications.

Let’s take a closer look inside the CPU core.

3

4

Here we’ll examine the RX CPU core itself, the pipelined instruction path, and the operand path. At the heart of the RX600 MCU is a 100MHz, 32-bit CISC CPU core seen here, capable of 1.65 Dhrystone MIPS/MHz. The CPU has sixteen, 32-bit general-purpose registers, striking the optimum balance between performance and cost.

There’s also:

- A full single precision floating point unit

tightly coupled to the CPU core.

- A multiply accumulate unit producing

either a 48-bit MAC result in one-cycle,

or an automatically repeating MAC

producing an 80-bit result for very

efficient DSP operations.

- A hardware multiply and divide unit.

- Fast interrupt control.

- On-chip JTAG debugger with

high-speed trace.

- Memory Protection Unit.

Now let’s examine the paths between the CPU core and memory. The RX is based on an enhanced Harvard architecture, with a 64-bit wide dedicated bus for instructions, and a 32-bit wide dedicated bus for operands, or data. This is an extremely optimum arrangement. The

longest RX instruction is 64 bits, and the native data size is 32-bits, but options exist for 8-, 16-, and 64-bit data as well. The data can even reside at odd address boundaries to eliminate wasted space in expensive SRAM. Typically, the instruction bus is connected to Flash memory, and the data bus to SRAM. But as we’ll see later it does not always have to be this way because RX has “enhanced” Harvard architecture.

Notice that the Flash can be read at 100MHz, the same frequency as the CPU. This means the CPU will not stall while waiting to read instructions from Flash at 100MHz, even when the instruction is 64 bits wide. Also notice that the SRAM can be read, and written, at 100MHz to give the CPU full speed access for data.

Now moving to the instruction pipeline. A pipelined architecture is absolutely necessary for any high performance CPU to reach that ultimate goal of executing one instruction in just one CPU clock cycle. The RX has a 5-stage pipeline, breaking up decoded instructions into many smaller parallel tasks, reducing the amount of sequential logic. A 5-stage pipeline enables the RX CPU core to be clocked extremely fast, in fact, increasing to 200MHz in the future.

The five pipeline stages are;

- Fetch

- Decode

- Execute

- Memory access

- Register Write-back

As you can see, on each CPU clock the instruction pipeline gets progressively filled, and by the 4th clock cycle, it’s already possible to achieve one clock per instruction, or one CPI. Looking closer at one individual cycle in the pipe you can see that the CPU can command both an instruction fetch and simultaneously access data in this same CPU clock cycle.

If not for the Enhanced Harvard architecture, this concurrent activity would not be possible, causing a pipeline stall.

So far we’ve discusses CPU access to extremely fast local Flash and SRAM memory, but the RX core can access slower memories and peripherals too. In this case there is a small pre-fetch queue to look ahead and prefetch instructions during idle bus cycles when executing from slower memory, such as external memory. This prefetch queue is 4 deep, and each stage is 64-bits wide. This means there can be as many as 32, 8-bit instructions in the queue at one time. The prefetch queue helps to maintain a steady flow of instructions to the CPU, filling gaps during idle bus time while instructions are sequential, but the queue also reduces stalls when the CPU takes a branch in instruction flow. If the target branch instruction is residing in the queue, there’s no delay. If the target is not in the queue, the queue must be flushed and reloaded. The actual delay to refill the first entry into queue depends on the access time of these slower memories.

And finally, there is a write buffer to prevent the CPU from stalling after writing to a slow external memory, or to a slow peripheral device. This buffer allows the CPU to carry on at full speed and not to wait for the write operation to complete.

4

Now let’s examine how the RX CPU core fits into the entire system

4

Here is the same RX CPU core we just discussed, with its 64-bit instruction interface, and its 32-bit data interface. We discussed “Enhanced” Harvard architecture, here is where it happens, in this bus matrix. The matrix allows either instructions or data to be accessed through any one of three different paths.

- The first is the 64-bit path to on-chip 100MHz Flash memory.

- The second is a 64-bit path to on-chip100MHz SRAM.

- The third is access to a high-speed32-bit wide, 50MHz internal bus,named Internal Main Bus 1.

The CPU is bus master to the bus matrix, with Flash, SRAM, and Internal Bus 1 acting as slaves.

Internal Bus 1 also gives the CPU access to the external bus pins through the Bus State Controller, or BSC, as well as to the on-chip peripherals through a Bus Bridge. We’ll see the connection to on-chip peripherals in a moment. Looking back at the bus matrix, you can see here that the CPU can fetch instructions from any of the 3 slaves while simultaneously the CPU can also access data (or operands) from any of the 3 slaves. This is “enhanced” Harvard architecture, meaning the CPU can execute code from SRAM, and access data from Flash if desired. This option allows very flexible operation, such as accessing data tables in Flash for example, or downloading code into SRAM and executing it, as another example.

In summary, the 100MHz on-chip memory means the RX CPU can run full speed with no delays from on-chip memory, the prefetch queue minimizes delays when the CPU accesses slower memories, and the RX CPU is the sole master of Internal Main Bus 1. Next we look at Internal Main Bus 2

5

Here is Internal Main Bus 2. Three bus masters can arbitrate for bus ownership; they are the Ethernet DMA controller, the general DMA controller, and the Data Transfer Controller. Each of these controllers offload the CPU while moving data from peripheral to peripheral, peripheral to memory, and memory to memory. The Internal Main Bus 2 can also access the external bus pins through the BSC, as well as accessing peripherals through a bus bridge.

RX has a very unique feature in the External DMA controller, or EXDMA. This DMA controller can take possession of the BSC and the external CPU bus pins to orchestrate the movement of data from one external device to another external device, with the data never entering the RX MCU. This is very efficient in that loading on the CPU is minimal even when high bandwidth data transfers are being conducted outside the chip. For example, EXDMA easily supports moving RGB image data from an external frame buffer RAM to an external TFT-LCD panel, as in our Direct Drive solution.

There are as many as six individual internal peripheral busses in some RX devices, grouping on-chip peripherals to optimize data flow. For example, the USB interface resides on its own peripheral bus to minimize interference from slower peripheral devices. Here you can see the large number of system functions on the RX for connectivity, analog, timers, and system functions. The blocks of peripherals shown here are grouped logically to simplify this drawing, but do not necessarily reflect how they are physically grouped on the device itself. Now let’s look at an example of how this bus structure can be exploited for maximum throughput.

Here you see that during one clock cycle the CPU can fetch instructions from Flash memory while at the same time the CPU can be writing data to a USB or serial channel to send data to another device. At the same time, the Ethernet DMA controller is moving data packets out on the Ethernet bus from SRAM while the EXDMA is moving data on the external data bus from one external device to another. This could be a graphic frame buffer DRAM and a color TFT-LCD panel. Notice that all four of these transfers are happening simultaneously, each one on separate physical busses with no interference to each other. But it’s still possible to move even more data while this is occurring.

The blue arrow shows that the general DMA controller can move data from a peripheral, like the ADC, into SRAM, by arbitrating access of SRAM with the Ethernet DMA controller. There is plenty of bandwidth on Internal Main Bus 2 because it’s 32 bits wide and operates at 50Mhz. This interleaved access of SRAM has very little impact on either the Ethernet or the ADC transfers.

And finally, the yellow arrow shows how the Data Transfer Controller can move data from a timer over to a DAC output, by using the peripheral busses. Again, there is minor arbitration needed on the Peripheral bus, but the bandwidth is more than sufficient to minimize interference. So in the end, here are 4 completely independent high-speed transfers occurring, plus 2 more interleaved transfers, showing the use of all five bus masters in the chip: the CPU, Ethernet DMAC, DMAC, DTC, and EXDMA.

6

Now let’s talk about the RX’s instruction set, and as we do, put yourself in the role of the engineer tasked with designing the instruction set for the RX. Your job is to improve code density, speeding throughput in all aspects of instruction handling and allowing the use of smaller memory devices or the ability to add more feature in the same memory. And you want to support modern high level languages and make it easy for the compiler writes to create efficient optimizers. So, how would you do this?

No doubt, first you’d try and get your hands on some real world application code. You’d look at it to see the kinds of applications your customers are writing, and what kinds of instructions and memory addressing they typically use. And that’s just what Renesas engineers did in designing the RX’s instruction set: they gathered real code from over 32 customers across a spectrum of industries and analyzed it. This led them to adopt a variable byte-length instruction to help minimize the code footprint. Once you did that, you’d look at histograms of instructions usage, take the most commonly used instructions and assign them to the shortest instruction code. Then you’d add flexible addressing to fully support the bus architecture we saw earlier.

So here’s the resultant RX instruction set. It includes standard arithmetic & logic instructions, instructions for data transfer, branching, bit manipulation, system control, plus special instructions to support floating-point, DSP operations and even string operations. 61% of these instructions have a single-cycle version, as well as longer versions that allow more flexible addressing. Let’s take a detailed look at how one particular instruction is highly optimized in the RX. As we mentioned, Renesas engineers analyzed code from real-world applications and found that the single most used instruction is the MOV instruction. This instruction accounts for 31% of all instructions in a typical embedded application.

7

Let’s see how MOV is implemented in the RX.

7

Here you see listed for the MOV instruction it’s Function, or type of move, and the source & destination for the move. In this case we’re moving a 32-bit immediate value into one of the RX general registers. I’ll also show you the length of the various forms of the Move instruction.

OK, first form is 6 bytes long, and of course 4 of these bytes hold the immediate data. Next are the functions of moving an immediate value out to memory (Flash or SRAM). In this case the compiler has many forms to chose from in terms of instruction length and address mode to do just what’s needed, nothing more, for the most optimum performance and memory footprint. For example, if just 1 byte of immediate data needs to be moved to a register, then why waste a longer instruction when a 3-byte form will do it? And finally, here are the remaining forms of Move, all very compact at only 2 bytes each, and very powerful.

Let’s take the last one for example, moving a 32-bit data item from one memory location to another memory location, with the 32-bit addresses of each of these locations stored in general registers as shown. This powerful memory-to-memory move occupies only 2 bytes of code space.

8

So, does all this optimizing of the RX instruction set pay off? Yes!

Five different embedded applications were benchmarked by Renesas using the RX and a Cortex-M3 based MCU. For all benchmarks the compilers were set to optimize code size for smallest memory footprint. The RX compiler produced code that was up to 28% smaller than the Cortex-M3 based MCU. You can see that a lot of precious program storage is wasted in the M3-based MCU by the fixed instruction lengths used versus the variable-byte-length instructions of the RX.

9

Let’s look now at interrupt processing on the RX.

Interrupts are used in embedded systems to react to time-sensitive events, and the RX provides a number of ways to optimize response to interrupts. Here’s how the RX handles a normal interrupt.

Once the interrupt fires, the CPU automatically resolves the interrupt source and selects the correct vector, pushes the Program Status Word and Program Counter on the stack, modifies the PSW to reflect the current interrupt state, and then starts execution of the user’s Interrupt Service Route, or ISR. This hardware automated portion of this process takes typically 7 clock cycles. At which point we’re in our ISR. The first thing we need to do is to save whatever registers we’ll be using onto the stack so that we can restore them on the way out of the ISR. Then we have our actual ISR processing, followed by the restoration of those registers we saved on the way in. This completes software interrupt processing, and as the ISR is exited with the RTE instruction, hardware processing continues with the hardware restoration of the PC and the PSW from the stack. This portion takes 6 clocks. This is pretty efficient, but what if we could cut out the pushing and popping of the PSW and PC?

Well we can with the RX Fast Interrupt. You can specify one interrupt source as the Fast Interrupt. The Fast Interrupt differs from other interrupts in that the PC and PSW, instead of being stored on the stack, are instead stored in dedicated backup registers which are much faster to access. So the hardware portion of the context save & restore are speed up.

Here’s what it looks like… When the interrupt fires now the PC and PSW are stored in the backup registers. This saves 2 clocks on entry. As before, software saves additional registers on entry to the ISR, executes the ISR code, and then restores registers on exit. Once the Return From Exception instruction is executed, the hardware portion of the context restore is shortened by 3 clocks over a normal interrupt as the PC and PSW are restored from the backup registers rather than the stack. Using the fast interrupt, we’ve saved 5 clocks over a standard interrupt. This is a nice improvement, but you can see there is still some housekeeping going on in software as we enter the ISR. Software has to save to the stack the registers it will use, and then pop them off on the way out. Wouldn’t it be great if we could dedicate a few of the general purpose registers for use by the ISR to eliminate this overhead? The RX compiler allows you to set aside up to four registers for use only by the fast interrupt routine.

Let’s see how the ISR looks now… We still have the same hardware entry, typically 5 clocks. But now there is no need to push or pop registers, the ISR starts running immediately with no need for saving context on entry, and no restoring of context on exit. And the hardware side of the return remains at 3 clocks. By setting aside a small block of registers, you can save many clocks over a standard interrupt. And while your main line code may experience a small decrease in performance as a result of running off of a smaller register set in this mode, with the RX you have the choice of how you’d like to optimize performance for your application.

10

We mentioned earlier that the RX provides dedicated hardware for floating point math. Let’s take a look at the RX Floating Point Unit, or FPU, now. The RX FPU supports IEEE 754 single precision floating point. It operates directly on registers or memory. In less sophisticated devices you have to copy your operands out to the FPU, issue a command, wait for the results, and then fetch the results. With the RX’s FPU you don’t need to do that – it is very efficient. For example, we can multiply the value in register R4 by a floating point value and store the result back in R4. The FPU can fetch the operands directly from the CPU register and from flash, and write the result back to a register. All IEEE 754 exceptions are handled, and a full complement of floating point instructions with appropriate addressing modes are included.

So is it fast? Yes – a basic floating point add on two registers completes in just 3 clocks, or 30 ns at 100 MHz. What does it mean for real-world applications? It opens up tremendous opportunity to do filtering, signal processing, and other tasks that would have formerly required a low-end DSP. As a real-world example, an 8-tap FIR completes in under a microsecond.

11

Let’s take a look under the hood and see how the compiler and the RX’s FPU work together to make your application run faster.

Here’s a typical real-world math problem: converting temperature from degrees C to degrees F. This is an overly simple example, but even in this simple example you’ll appreciate how much overhead the FPU eliminates.

Here’s some C code to do the conversion – it’s written using single precision match, so it’s nice and readable.

Let’s break it down:

- First we have some single precision

variables

- Floating point constants

- Three different floating point operations

12

Let's a look at the code emitted by the compiler. Surprisingly only five instructions are

needed. We see our constants are stored in IEEE 754 format. We can see that we have RX

floating point instructions and that these instructions operate directly on memory and

registers.

13

To really appreciate the efficiency of the FPU, let’s look at what happens if we tell the compiler to turn of FPU code generation and to instead use the floating point emulation software library.

We’ll look at just one floating point instruction – the floating point divide or FDIV. With the FPU support turned off, the compiler makes a call instead into a library function to do the math, you can see it’s quite a few instructions…and some more instructions…and some more instructions…and some more…

Imagine how much extra time all of that takes! What if you’re in a loop? It takes over 100 instructions to do the same work as one instruction when using the FPU!

14

Next we look at the effects of the native Flash operating speed vs. CPU performance.

The RX family uses Renesas’s 90nm flash MONOS technology having the fastest access time of all embedded Flash memory in the MCU industry, at just 10nsec read time, or 100MHz. Because the Flash memory can provide instructions to the RX CPU at the same rate the CPU consumes them, there’s no need for memory acceleration techniques.

If we examine a pipelined processor running at 30 MHz coupled to a Flash memory with a native speed which is also 30 MHz, there are no pipeline stalls as shown here and performance is linear with clock speed. However, once the speed of a CPU with slower flash rises above 30 MHz, 1 wait-state must be added after the Instruction Fetch stage to wait on the flash. And again at every multiple of the native Flash speed. Each time reducing overall CPU performance as you can see on the graph. The RX, however, can continue all the way up to 100MHz with no wait-states as shown.

15

So how to all these features work together? What are some real measurements we can look at to objectively determine the performance of the RX?

Dhrystone is a benchmark that's been around for years. And while it isn't the best measure of all the things make an embedded system perform, it is a good baseline.

The RX clocks in at a very impressive 1.65 Dhrystone MIPS/MHz. Because of the zero-wait state, 10nS flash, this performance scales linearly right up to 100 MHz delivering an impressive 165 DMIPS.

A new suite of benchmarks more geared towards embedded systems has been crafted by EEMBC. Their base benchmark is the Coremark. The RX scores very well here, too, with a Coremark of 224.74 and a Coremark/MHz score of 2.24 (again, performance scales linearly on the RX).

Finally, Renesas has performed some internal benchmarks on the floating point unit in the RX to measure its suitability for DSP-like applications such as digital filtering. The RX can perform an 8-tap floating-point Finite Impulse Response filter in under 1uS. This level of performance allows the RX to be used in applications that were previously the domain of DSP's.

16

17

Let’s summarize what we’ve seen:

- In this module you’ve learned about the

RX CPU core and pipeline,

- The interfaces used by the core to talk to

memory and peripherals,

- The unique instruction set of the RX that

merges the best of CISC & RISC,

- The flexible interrupt handling that

allows you to craft low-latency service

routines,

- The single-precision floating point unit

that gives you number-crunching power

in a microcontroller,

- And the high-speed memory that makes

sure all of these features flow through

the chip at full speed.

Thanks for watching!

18

Thank You

19

Hello and welcome to this Renesas Interactive module that ... · PDF fileHello and welcome to this Renesas Interactive module that provides an architectural overview of the RX Core.

Documents