HW-RTOS Improved RTOS Performance by

White Paper – Renesas R-IN32M3 Industrial Network ASSP
HW-RTOS
Improved RTOS Performance
by Implementation in Silicon
Author: Carl Stenquist
Renesas Electronics America Inc.
May 2014
Abstract
A Real Time Operating System (RTOS) is an integral part of an embedded system as applications
are becoming more complex. According to one embedded market study1 the use of an RTOS or
scheduler is required in more than 68% of applications. The problem with many software RTOS
is that they are inherently dependent on the processor’s performance and load. Finding ways to
optimize the performance of the RTOS within each CPU architecture is therefore key.
This paper describes the performance improvement from an RTOS accelerator implemented in
silicon. The Renesas Industrial Network ASSP (R-IN32M3) embeds a “Real-Time OS Accelerator”
(HW-RTOS) block that executes common RTOS system calls in hardware including task scheduling, prioritization, as well as managing semaphores and mailbox operations. Benchmarks show
that context switching can execute up to 2-3x faster than typical SW-RTOS operation at the same
CPU clock speed, with significantly less jitter.
I. Introduction
An RTOS enables a system to conveniently be divided into subtasks (processes) with clear interfaces between them. The subsystems, that is the RTOS tasks, can then be designed independently.
The tasks will then communicate with each other through message queues, semaphores, flags etc.
provided by the RTOS services. An RTOS also provides means to easily schedule tasks to make
sure time deadlines are met.
A typical software RTOS is a kernel library that manages all this. Its algorithms will optimize the
operation for task priority level, and distribute access to limited hardware resources. See Figure 1.
In a preemptive RTOS, a CPU timer ‘tick’ wakes the kernel at a regular interval to determine if it is
time to switch the running task.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 1 of 11
This operation is well and good for
most applications as there are so
many other subsystem checks and
operations that the processor has to
manage. But in industrial networking
applications, it is necessary to support
the real-time behavior of protocols
such as EtherCAT, Ethernet/IP or
Profinet IO. Adding a traditional RTOS
could reduce the overall speed of the
system, and may add jitter.
Figure 1: Services that an RTOS
provides to an application.
Traditional SW-RTOS
Application
(Software)
■
■
■
OS resource secured
System call processing
Dispatch
Task Scheduling
OS resource management
Tick processing
■
Timer (tick count)
■
■
SW RTOS
Library
Hardware
Assign OS resource
System call is made
■
■
■
Benefits of hardware accelerated RTOS
In this paper we look at how implementing the RTOS in silicon can lessen administrative CPU
overhead. There is no “timer tick” interrupt to determine whether it is time to preempt the currently running task since this is taken care of by a timer in the HW-RTOS block. Since this is done
in hardware, there is also inherently less execution jitter. Jitter is caused by varying time for RTOS
functions to run, due to system state, number of tasks, resources in use etc. managed by the CPU.
II. Software RTOS vs. Hardware RTOS
Although the RTOS manages parallel tasks, the actual task sequence is based on how the RTOS
manages the CPU. Figure 2 below shows a typical RTOS operation. As Task A is running, an
interrupt from a peripheral triggers the RTOS to execute a glue routine that initiates the interrupt
handler. As part of the interrupt handler a system call is done based on what is required in the
operation. If another Task (B) is required with higher priority, the interrupt handler exits and
dispatches the Task B to execute.
Interrupt
from Peripheral
Interrupt
System
Call
Hardware
CPU
Interrupt
Handler
RTOS
User
Task
Jitter is introduced due to
indeterminate CPU loading
Glue
Routine
Dispatch
Task
A
Task
B
Figure 2. An RTOS can also manage interrupt routines. When such an interrupt service routine
(ISR) is finished, the RTOS determines whether a different task should run instead of the latest
one. This is called preemption. (Preemption can also occur at a timer tick/timeout.)
However even with a system that uses a modern, fast MCU and SW RTOS, there may be several
tasks with conflicting hard deadlines that must be met. This can cause uncertainty. How long will
it take for my data input to be read, processed, and for a new output value to be set? How do I
calculate worst case?
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 2 of 11
In addition, if the timer tick is set to a high rate to avoid surpassing a deadline, perhaps the
constant interruption the scheduler incurs will in itself use up more precious CPU time. Since
scheduling takes place in hardware for HW-RTOS (context switching is still done in software)
total context switch time will be reduced.
Reduced Jitter
Even if a worst case timing is determined, when a system later grows, the deadline may no longer
be met. An engineer that makes a change to the system may not be aware that some other deadline, outside of his work assignment, may be violated. This is in large part due to jitter.
With some of an RTOS’s functionality, such as time management and task scheduling, handled in
hard logic and not by CPU instructions, variation of RTOS function execution times due to system
state (number of tasks and other OS resources) will diminish.
This is for two reasons. First, the execution time needed to read and execute code by a CPU is
greater than the time it takes for hard wired logic gates in silicon to run to completion. Secondly,
the execution time will vary for a SWOS with the number of tasks, semaphores, flags, queue size
etc. The time taken will depend on the system state at the time of the timer tick, and this state may
be very complicated in a non-trivial system.
To summarize, it may be difficult to work out exactly what the maximum time is for the RTOS to
execute a task. However, with HWOS, since scheduling and system resources are managed in
hardware and therefore executed in parallel with the CPU and also executed faster than the CPU
can, this uncertainty will diminish.
Task timing and OS tick offload
In a conventional preemptive RTOS an OS “timer tick” regularly interrupts the current task to
check whether a higher priority task is ready to run. As a part of this system interrupt the kernel
checks if any task has asked for a timeout and is therefore a candidate to run. This tick processing,
even if there is no rescheduling, will use up precious CPU time.
• In the HW-RTOS, there is no tick interrupt. Timing management is taken care of by the HW-RTOS
block in hardware. For HW-RTOS, a running task can be switched: When its internal clock, the
OS reference timer, causes preemption for a timeout previously called for by a task.
• When a call is made to the RTOS.
• When an interrupt occurs.
Suppose that at a given moment only one task has made a call to the OS pending for a timeout.
For example; one task is waiting for a flag to be set, but with a timeout. If the timeout expires,
HW-RTOS will then preempt the running task and reschedule to the task with the highest priority.
In the meantime, no time is lost running a timer tick handler, and executing the kernel, when it
turns out there are no timeouts pending. In addition to speeding things up, this also helps reduce
jitter. For illustration on scheduling and task switching, see to Figure 6.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 3 of 11
III. HW-RTOS on R-IN32
The R-IN32M is an industrial network ASSP
that contains a combination of peripherals and hardware IP blocks that accelerate
the processing of Ethernet communication,
while being able to manage RTOS operation
for complex industrial applications.
The device includes either an EtherCAT
slave controller with an integrated PHY, or
a CC-Link IE slave that supports Gigabit
Ethernet performance. There is a SRAM
interface, which can be used as a high speed
slave port when connecting to a Host CPU.
In addition, the R-IN32M3 core CPU has an
ARM Cortex-M3 32-bit RISC, with an integrated dual 10/100 MAC, a hardware three
port switch, dual Ethernet PHY (-EC version),
a dedicated DMA controller, and a separate
buffer area for the network processor.
R-IN32M3-EC
CAN 2ch
Cortex-M3
CPU Core
100MHz
UART 2ch
CSI 2ch
4ch Timer Array
Watchdog Timer
General Port
I2C 2ch
Hardware
Real-time OS
CC-Link
Real-time Port
Real-time Port
DMAC 1ch
Internal RAM
with ECC
General
DMAC 4ch
Instruction
768KB
Serial Flash
ROM I/F
Data
512KB
SRAM I/F or
Host CPU I/F
Buffer
64KB
Ethernet Accelerator
Check-sum/
Header ENDEC
Buffer Allocator/
Buffer Manager
EtherCAT
Slave Controller
Ether MAC
2-port Switch
2 ports
ETHER PHY
100 Tx/Rx
In conjunction with the HW-RTOS, the Ethernet Accelerator on the R-IN32 will help to achieve
more deterministic communication. (Less jitter + higher speed).
SW-RTOS functions done In hardware
If we compare Figure 1 that shows a typical SW-RTOS, where the resource management,
queuing, task scheduling are done in the SW-RTOS kernel, the HW-RTOS on the R-IN32M3
executes the system calls, including the scheduling and tick processing. The advantage is that
you can use the same system call command using standard SW-RTOS API but some functions
are accelerated within the HW-RTOS block.
Traditional SW-RTOS
Application
(Software)
■
■
■
■
SW RTOS
Library
■
■
■
■
Assign OS resource
System call is made
OS resource secured
System call processing
Dispatch
Task Scheduling
OS resource management
Tick processing
RTOS Accelerator in HW
■
■
■
■
■
■
Timer (tick count)
OS resource secured
System call processing
Dispatch
HW-RTOS
■
■
■
Hardware
Assign OS resource
System call is made
■
System call execution
Task Scheduling
OS resource
management
Tick processing
Figure 3. Functional diagram of how functionality has moved from software to hardware.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 4 of 11
HW-RTOS blocks
Figure 4 shows the HW-RTOS scheduler block and OS resources, the CPU, the instruction/data bus,
and how interrupts are routed.
Interrupt
HW-RTOS
Interrupt
(x1)
Task Scheduler
CPU
(Cortex M3
on R-IN32M3)
Hardware ISR
Data RAM
System Timer
Current Task
SP_table[]
OS Resource
System Call
Reg.
Task
Event
Semaphore
Mailbox
BUS I/F (AHB Bridge on R-IN32M3)
Stack
Task 1
Task n
BUS (AHB on R-IN32M3)
Instruction Memory
OS Library
Figure 4. Block diagram of HW-RTOS in the R-IN32M3.
“Hardware ISR”, the top yellow box in HW-RTOS block takes care of interrupts, except “x1”which
is issued to the CPU to call the HW-RTOS driver library (bottom of picture).
The HW-RTOS provides semaphores, mailboxes, flags and mutexes. It has OS management calls
to put a task to sleep, rotate task precedence, disable OS dispatching, etc.
The HW-RTOS has a “hard” interrupt mechanism where preregistered service calls can be automatically run when a particular interrupt occurs. These automatic interrupt service calls can be
semaphore or flag signaling, or to wake up a task. No software is involved.
Tasks, semaphores, flags, mutexes, mailboxes etc can be created statically at compile time, or
dynamically as appropriate at runtime.
Figure 5 below shows the amount of resource managing and communication objects available for
HW-RTOS on the R-IN32M3.
HW-RTOS on R-IN32M3
Total number of contexts that can be handled
64
Number of context priorities
16
Number of semaphores (binary or counting) and mutexes
Total 128
Number of events
64
Number of mailboxes
64
Number of mailbox messages
192
HW-ISRs
Max 32, selectable from 128 QINTs
Figure 5: Table of HW-RTOS resources
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 5 of 11
API
The HW-RTOS is written with both uITRON RTOS standard API and uC/OS-III HW-RTOS API as
templates. There are some 30 system calls for resources such as event flags, semaphores, and
mailboxes.
Priorities
The task with the highest priority (lowest number) is run when the scheduler is invoked. Several
tasks may have the same priority. In that case tasks are scheduled by a FCFS (First-Come First
Served) mechanism that can manage up to 32 tasks.
Task scheduling
There is no need for preemptive scheduling using a “timer tick”. This is because tasks may be
scheduled to run by any of the following causes:
1. At a specific time, that is a certain system clock value. For example; receive from a mailbox with
timeout, or lock a mutex with a timeout.
2. A HW-RTOS (“system”) call is made from a task, at which time the kernel sees that another task
has a higher priority.
3. When an interrupt occurs:
a. A system call can be made from an interrupt without any software being involved. This
feature must be set up at compile time in the “Hardware ISR” table.
An entry causes a certain interrupt to make a flag or semaphore call.
b. A system call is made from a SW ISR.
Figure 6 shows how HW-RTOS determines execution flow when a non-interrupt system call is
made. The actual context switch is done by a driver library.
Task A
Task B
Driver
System
call
HW-RTOS
Set HW-RTOS registers
(converts request to HW setting)
Result
NO
Return to
Task A
Context switch?
System call operation
picks next task
Error code
Next task ID
YES
Save current context (register set)
Change context – stack pointer
Restore next context (register set)
Run
Task B
Figure 6. Non-interrupt system call execution flow. HW-RTOS determines
what task to execute and a driver does the context switch.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 6 of 11
Queues
There is no queuing services in the HW-RTOS API, instead mailbox services with flexible priority
schemes is incorporated. Each mailbox is either consumed in FIFO order, in task priority order,
or by message priority order. Each message contains the mailbox ID and a pointer to the actual
message data.
Interrupts, the HW-ISR
When a task timeout expires and HW-RTOS determines it is time to reschedule, a dedicated interrupt is reserved in the ARM core. The CPU services this interrupt and relays execution to the task
selected by HW-RTOS.
To increase speed, a user can instead of using a software ISR preconfigure the HW-ISR table to
perform certain services; signal a flag set, post to a semaphore, or wake up a task. A software ISR
routine is in that case not even necessary. This is illustrated in Figure 7.
SW-RTOS case
HW-RTOS case
Task A
Task A
Task B
Interrupt ISR
Task B
(Dedicated to
signal semaphore)
wait_semaphore
Interrupt used to signal
semaphore, ISR not needed
wait_semaphore
Interrupt
HW-RTOS
X
Wake up
sig_sem() called
by HW-RTOS
sig_sem()
Wake up
RTOS runs interrupt to change task.
ISR is not necessary, saving time.
Figure 7. If the static HW-ISR table is prepared at compile time,
interrupt service routines in software can be omitted.
Interrupt
from Peripheral
HW-ISR is processed
before CPU-ISR
Interrupt
Hardware
Context switch here –
But only if needed
HW-ISR
including system call processing
Interrupt to ISR
only if ISR exists
*For example Set flag, Sig Sem, Rel Wai, Wup Tsk
CPU
Context switch
(+ run of SW-ISR)
RTOS
User
Task
Task A
Task A is uninterrupted
while HW-ISR runs...
Task B
if no SW-ISR
Task B only if context
switch due to HW-ISR
Task B
...and Task A may continue unless
context switch due to ISR system call
Figure 8. HW- and SW-ISR processing in greater detail with time on horizontal axis. White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 7 of 11
Mutex
HW-RTOS does not protect against deadlock / priority inversion for mutexes. Priority inversion is
when a high priority task is waiting for a suspended low priority task occupying a resource. Since
HW-RTOS doesn’t support priority inheritance, this must be added by user software.
Other features
Here are some other features worth mentioning:
• Release task from waiting, wakeup a task, cancel wakeup, and put the calling task to
sleep with timeout option.
• Task delay argument is 32 bits; 1 ms to 1100 hours.
• Ethernet MAC with built-in DMAC.
Features not available
• Priority inheritance (inversion) or the priority ceiling protocol of uITRON. This would
need to be done by software.
• Deadlock detection/avoidance on non mutex resources. However, you can break out of
a deadlock with a timeout.
• Stack over-/under-flow surveillance in hardware and software.
IV. Performance Test
A performance test was done using a Tessera R-IN32M3-EC evaluation board, connected to a
Windows 7 64-bit machine.
HW-RTOS vs. off-the-shelf SW-RTOS
The author ran some benchmarks between HW-RTOS and Micrium’s SW-RTOS uCOS-III, ported
for the R-IN32M3. Observed that the uC/OS-III RTOS used did not use the HW-RTOS block. (Such a
port has since been developed.)
The tests were run using an R-IN32M3-EC board and IAR toolchain (ARM 6.70). As OS reference
timer the system clock is used. This was 100 MHz in the studied system. The author only ran tests
to compare usage of flags and semaphores, with and without preemption.
Only semaphore and event flags were tested; with and without context switching for each respective call.
The author found that HW-RTOS’s main benefit on the R-IN32 is for applications that have heavy
task switching. That is, the user software processes are often swapped in and out. This is common
e.g. for motor control systems. HW-RTOS showed task switching operation to be over twice as fast
for most calls. Without context switch — just system call then proceed with same task — the speed
of HW-RTOS did not change much. Here, the SW RTOS did much better and was in fact faster than
HW-RTOS. For SW RTOS, the measured time with context switch was around 5 to 8 times that of
the time for the same SW RTOS call without context switch.
The following is a list of situations where the author saw noteworthy benefits in the number of
microseconds it took for a switch tasks. In all these cases, a task switch occurs.
1. A task calls the RTOS, and there is another task waiting that has a higher priority.
2. A task is released from waiting (pending) on the RTOS to release a resource (e.g. flag,
semaphore) that is not available at the moment. This was the case both for a resource released
via another task, and (even more so) when a resource is released via a call from an interrupt.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 8 of 11
3. A task has previously called HW-RTOS asking to be awoken at a specific time.
OS-calls that did not result in a task switch did not result in any improvement. These were actually
slower.
HWRTOS
Category
Scenario
Test type
Clocks @100 MHz. Green is faster.
Start/Create task
Semaphore
175
268
0 stat, 188 dyn.
86
No context switch
137
74
With task switch
168
497
Non-block
156
76
With task switch
201
399
0 stat, 190 dyn.
79
No context switch
149
79
With task switch
191
529
Non-block
183
119
With task switch
202
480
End of ISR to task waiting for flag
resume
147
186 (823*)
End of ISR to task waiting for semaphore
resume
142
(Not measured)
Create
Signal (Post)
Wait (Pend)
Event Flag
Create
Set (Post)
Wait (Pend)
Interrupt
context switch
Micrium
Preempt
Figure 9. Actual measure made by author comparing HW-RTOS with a traditional SW-RTOS*
using the R-IN32-EC board.
*Note that since this paper was written, the Micrium uc/OS-III HW-RTOS has been developed for the R-IN32M3.
Memory footprint
The writer found when testing that compared with the used SW-RTOS (uCOS-III) around 25% less
RAM, and around 15% less flash was used. Larger memory needs of a SWRTOS typically consist
of the tasks’ stacks and space requirements to store data structures and actual RTOS program
code.
Jitter
Jitter is as we said a variation in task execution over time. Lowering jitter will reduce any risk of
varying or unexpected behavior. That is, it will result in a more deterministic performance.
Improvement in jitter was not measured by the author as this requires a larger project using the
RTOS. An internal study from Renesas Japan roughly estimated that there was a 20%-80% reduction improvement in the jitter against a software RTOS implementation on a different MCU. This
showed that the HW-RTOS has a much more consistent (stable) execution period.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 9 of 11
0
2
4
6
8
[µs]
pol_sem clr_flg
SW-RTOS
wai_flg
R-IN32M3
System call
SW-RTOS
OS processing time (event flag)
R-IN32M3
set_flg
sta_tsk(act_tsk) ter_tsk ext_tsk
System call
OS processing time (task mng)
OS processing time (semaphore)
2
4
6
8
[µs]
0
2
4
6
8
[µs]
System call
R-IN32M3
SW-RTOS
R-IN32M3
snd_mbx
wai_sem
SW-RTOS
rcv_mbx
OS processing time (mailbox)
sig_sem
System call
pol_sem
0
0
2
4
6
8
[µs]
Figure 10: Chart shows a comparison between the R-IN32M3 using HW-RTOS, and a comparable
MCU running at 100MHz, using a SW-RTOS.
V. Summary
In this paper we analyzed the features and performance improvements using the “Real-Time
OS Accelerator” (HW-RTOS) hardware IP on the Renesas R-IN32M3 industrial networking ASSP.
Benchmarks showed that tasks could execute up to 3x faster than typical SW-RTOS operation at
the same CPU clock speed, and at significantly less jitter. Compared with a typical software RTOS
operation that is basically sequential, the HW-RTOS is closely tied to the CPU and allows for interrupt handling while not interrupting the current task, and that the performance is not dependent
on the number of task switching.
By simply doing the system calls through familiar RTOS environments such as uItron or uC/OS-III
HWOS, one can easily manage multiple tasks while having the hardware IP do the heavy load of
resource management and prioritization.
So the R-IN32M3 HW-RTOS does help to improve RTOS operation, and would be more evident
and advantageous for industrial networking applications – which matches the R-IN32M3 target. On
the other hand, having an accelerator for RTOS in silicon would in fact benefit to a wider range of
applications.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 10 of 11
References
1. 2013 Embedded Market Study, UBM Tech Electronics, April 2013
2. “R-IN32M3 Series Programming Manual (OS edition)”, doc. nr r18uz0011ej0300_rin32m3.
3. “Hardware Real-Time Operating System for FPGA based embedded systems”, by Anders
Blaabjerg Lange. June 2011.
4. uITRON specification 4.0
http://www.t-engine.org/wp-content/themes/wp.vicuna/pdf/specifications/en_US/WG024S001-04.03.00_en.pdf
5. Micrium, Hardware-Accelerated RTOS: µC/OS-III HW-RTOS and the R-IN32M3.
White Paper – HW-RTOS: Improved RTOS Performance by Implementation in Silicon
Page 11 of 11