Announcements

# Performance Counters Results and clocks / operation

## Performance Counters Results and clocks / operation

Level 3
Level 3
Hello,
I found the Perf_Counters example in the AURIX Github examples.
For my curiosity I run some basic operations and then checked the perf counters values:
float division = 71 clocks
float multiplication = 67 clocks
uint division = 68 clocks
uint multiplication = 66 clocks
Do there result depend also on the compiler, or they are only processor dependent?
I am quite amazed that the float operations take almost same as the integer, this means that for precision calculation float could be used without problems.
Is there something that is escaping me?
7 Replies

## Re: Performance Counters Results and clocks / operation

Employee
Employee
Did you generate a list file and look at the instructions being generated?

Is the compiler set up to use the native FPU instructions?

## Re: Performance Counters Results and clocks / operation

Level 3
Level 3
Just used the Perf Counters example

## Re: Performance Counters Results and clocks / operation

Level 6
Level 6
In general there is no penalty for using floating point instructions. For many applications, it might even work out faster, because many floating point instructions can perform two operations in the same cycle.

## Re: Performance Counters Results and clocks / operation

Level 3
Level 3
56 x = x * 3.156789f;
000000008000004e: movh.a a15,#0x6000
0000000080000052: lea a15,[a15]0x0
0000000080000056: movh.a a2,#0x6000
000000008000005a: lea a2,[a2]0x0
000000008000005e: ld.w d15,[a2]
0000000080000060: mov d0,#0x8d5
0000000080000068: mul.f d15,d15,d0
000000008000006c: st.w [a15],d15

This is the instructions that code the multiplication. x is declared as float.
Are these the right instructions?

5.5
Summary of functional changes from TC1.3.1
The TC1.6P and TC1.6E CPUs utilise different pipeline organisations than that used in
the TC1.3. One effect of the new pipeline organisation is to increase the load-use penalty
to 1 from 0. This necessitates re-scheduling of code to achieve optimum performance.
Other significant adaptations to the existing TC1.3.1 CPU are as follows:
Fully Pipelined Floating Point Unit (FPU)
– Most floating point instructions now have a repeat rate of 1
Improved debug system - now decoupled from protection system.
– 8 comparators proving up to 4 ranges, selectable for PC or load-store address
Expanded and enhanced memory protection unit (MPU)
– 16 data ranges and 8 code ranges.
New Temporal protection system.
– Guards against task runtime overrun.
New Safety protection system. Tasks identified as safe by new PSW bit (PSW.S)
New instructions for improved Interrupt and Data Cache manipulation support.
– DISABLE, RESTORE, CACHEI.I
New instructions for Fast Integer Divide
– DIV, DIV.U
New Instructions for fast call and return with minimal saving of state.
– FCALL,FCALLA,FCALLI, FRET
Long offset addressing mode introduced for byte, half word and address accesses.
– LD.BU, LD.B, LD.HU, LD.H, ST.B, ST.H, ST.A
Extended range of 16 bit jumps
– JEQ, JNE
New Synchronisation Instructions
– CMPSWAP.W, SWAPMSK.W
New CRC instruction
– CRC32
New wait for interrupt instruction
– WAIT
Increased flexibility in the system address map.
Full SECDED ECC protection for all scratch, cache and tag memory structures.
Cache and Scratchpad memory systems now entirely separated.
Selectable interrupt vector table size (32bytes/entry, 8bytes/entry).

## Re: Performance Counters Results and clocks / operation

Employee
Employee
Sorry, I don't understand the issue?

Are you looking for an explanation of the assembly code?

` x = x * 3.156789f;000000008000004e: movh.a a15,#0x60000000000080000052: lea a15,[a15]0x0 ; load the address of the variable x to store the result of the float operation in DSPR CPU10000000080000056: movh.a a2,#0x6000 000000008000005a: lea a2,[a2]0x0 ; load the address of the variable  x for the operand in the float, located in DSPR CPU1000000008000005e: ld.w d15,[a2]  ; load the data pointed to by the address in A20000000080000060: mov d0,#0x8d50000000080000064: addih d0,d0,#0x404a ; load your const float variable  in d0 to be used as an operand0000000080000068: mul.f d15,d15,d0 ; perform a float operation between operands d15, and d0 and the result is in d15000000008000006c: st.w [a15],d15; store the result of the float operation at location pointed to by A15`

I am not sure of the optimization level but you could save the extra address register load since you are using x for both an operand and result:
`movh.a a15,#0x6000lea a15,[a15]0x0ld.w d15,[a15] mov d0,#0x8d5addih d0,d0,#0x404a mul.f d15,d15,d0 st.w [a15],d15`

## Re: Performance Counters Results and clocks / operation

Level 3
Level 3
Hi cwunder, thanks for the explanation!!!
I posted the assembly to be sure that the float instructions are used.
Yesterday I studied a little bit the 5000 pages!!! TC277 user manual, and found this interesting info:

The TC1.6P and TC1.6E CPUs utilise different pipeline organisations than that used in
the TC1.3. One effect of the new pipeline organisation is to increase the load-use penalty
to 1 from 0. This necessitates re-scheduling of code to achieve optimum performance.
Other significant adaptations to the existing TC1.3.1 CPU are as follows:
• Fully Pipelined Floating Point Unit (FPU)
– Most floating point instructions now have a repeat rate of 1

5.9.3 Floating Point Pipeline Timing
These instructions are only valid if the optional Floating Point Unit is implemented.
Each instruction is single issued.
Table 5-38 Floating Point Instruction Timing

Floating Point Instructions
Instruction Result Latency TC16P E Repeat Rate TC16P E Instruction Result Latency TC16P E Repeat Rate TC16P E
ADDF 2 2 1 1 ITOF 2 1 1 1
CMP.F 1 1 1 1 MADD.F 3 2 1 1
DIV.F 8 7 6 6 MSUB.F 3 2 1 1
FTOI 2 1 1 1 MUL.F 2 2 1 1
FTOIZ 2 1 1 1 Q31TOF 2 1 1 1
FTOQ31 2 1 1 1 QSEED.F 1 1 1 1
FTOQ31Z 2 1 1 1 SUB.F 2 2 1 1
FTOU 2 1 1 1 UPDFL – – 1 1
FTOUZ 2 1 1 1 UTOF 2 1 1 1

So DIV.F has 8,7 or 6,6, dunno which is the TC277D ?
In comparison DIV.U has 4-11,3-10 or 3-9,3-9

## Re: Performance Counters Results and clocks / operation

Level 6
Level 6
On a TC27x, CPU0 is 1.6E, and CPU1/2 are 1.6P. See the TC27x Block Diagram in the User Manual.