Floating Point Performance to weak?

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
Not applicable
Hello everyone,

for profiling I usually use a rudimentary procedure as this:

At the beginning of the code section, which has to be evaluated towards its performance, a GPO is toggled and at the end it is toggled again. The time between the both toggles is measured with an appropriate oscilloscope.

If I do this for the XMC4500 I get values which are far beyond the results I would expect. I was not sure what the delay is for toggling of a GPO. That’s why I first reduced the code, which performance shall be measured to nothing (as shown below). Afterwards I added a single float 32 multiplication, measured again, added another multiplication, measured and so on. The interrupt routine has the highest priority of all periodically executed ISRs defined in my system, but to be sure the measurement is not interrupted I handle it as critical section (current Interrupt gets temporarily the highest possible priority).
As the abstracts of the .ld and .map files indicate, the selected ISR is executed in PSRAM :

/*=== IRAMCode section for PSRAM code ============*/
sIRAMCodeLoad = eROData + __Xmc4500_Data_Size;
IRAM_Code : AT (sIRAMCodeLoad)
{
sIRAMCode = ABSOLUTE(.);
* (.IRAMCode);
. = ALIGN(4);
eIRAMCode = ABSOLUTE(.);
} > PSRAM_1
IRAMCodeSize = eIRAMCode - ORIGIN(PSRAM_1);

.IRAMCode 0x100006e8 0x688 APP/BU_Ictrl.o
0x100006e8 VADC0_G0_2_IRQHandler

So in the end the code looks like this:

void VADC0_G0_2_IRQHandler (void) __attribute__((section(".IRAMCode")));

void VADC0_G0_2_IRQHandler (void) {

uint32_t ISR_Prio;

ISR_Prio = PPB->NVIC_IPR5; // Store old ISR priority
PPB->NVIC_IPR5 = 0xfcfcfc00; //Set Priority to 0 for VADC0_G0_2_IRQHandler
TOGGLE_GPO(PORT0->OMR, PORT0_OMR_PS9_Pos); // Toggle GPO P0.9

//I1_Err=BU_Ictrl_I1_PI_Reg.Err*163.83; //Activate for 2. to 7. Measurement
//I1_Ref=BU_Ictrl_I1_PI_Reg.Ref*163.83; //Activate for 3. to 7. Measurement
//I1_ITerm=BU_Ictrl_I1_PI_Reg.Ui*16383.0; //Activate for 4. to 7. Measurement
//I1_PTerm=BU_Ictrl_I1_PI_Reg.Up*16383.0; //Activate for 5. to 7. Measurement
//I1_Out=BU_Ictrl_I1_PI_Reg.Out*16383.0; //Activate for 6. to 7. Measurement
//ADC_I1=BU_Ictrl_I1_lpf*163.83; //Activate for 7. Measurement

TOGGLE_GPO(PORT0->OMR, PORT0_OMR_PS9_Pos); // Toggle GPO P0.9
PPB->NVIC_IPR5 = ISR_Prio; // Restore old priority for VADC0_G0_2_IRQHandler
}

The results, measured with a Tektronix MSO 4054:
1. 50ns
2. 950ns
3. 1900ns
4. 2900ns
5. 3900ns
6. 4800ns
7. 5800ns

I had to vary the scale of the oscilloscope, which of course influences the measuring accuracy, that’s why the time increase is not perfectly linearly although identical instructions are added. Nevertheless I expected that adding simple multiplications would lead to a time increase of a few cycles (8.33ns each) but not in range of almost a 1µs (more than 100 Cycles for a multiplication?).

As seen below the options of the ARM-GCC C Compiler indicate that the hardware FPU is activated and the soft floating point ABI is used (if I deactivate it by setting the option –mfloat-abi=soft the time increase is more than twice as big). Nevertheless I would have expected that deactivating of the hardware FPU would decrease performance more than with a factor of 2.

-DDAVE_CE -DUC_ID=4502 -D__FPU_PRESENT -DARM_MATH_CM4 -I"D:\DAVE-3.1.8\eclipse\/../CMSIS/Include" -I"D:\DAVE-3.1.8\eclipse\/../CMSIS/Infineon/Include" -I"D:\DAVE-3.1.8\ARM-GCC/arm-none-eabi/include" -I"D:\DAVE-3.1.8\eclipse\/../emWin/Start/GUI/inc" -I"D:\DAVE-3.1.8\eclipse\/../CMSIS/Infineon/XMC4500_series/Include" -I"D:\progXMC\blx_xmc\Dave\Generated\inc\MOTORLIBS" -I"D:\progXMC\blx_xmc\Dave\Generated\inc\LIBS" -I"D:\progXMC\blx_xmc\Dave\Generated\inc\DAVESupport" -O3 -ffunction-sections -Wall -std=gnu99 -mfloat-abi=softfp -Wa,-adhlns="$@.lst" -c -fmessage-length=0 -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 –mthumb

I am quite sure that my measuring procedure delivers reasonable results, because if I add some more multiplications to the VADC0_G0_2_IRQHandler interrupt (executed with 25kHz), I get problems with the connection to the target via RS232, obviously because most of the available time is consumed by this very interrupt.

In conclusion I am pretty perplexed what I can do to improve performance or is this performance already the best I can get? Maybe I don’t see the forest within all the trees, my whole approach is faulty or am I missing an important point?

Thanks for any helpful suggestions.
0 Likes
3 Replies
gwang
Employee
Employee
Dear Sir,

I think, in your test the floating instruction "vmul.f32" is not used. Please check your assembly codes, if vmul.f32 is available. To make sure that the vmul.f32 is used in compiler, I suggest to change the codes with indicating the data type of constant like:

I1_Err=BU_Ictrl_I1_PI_Reg.Err* (float)163.83; //Activate for 2. to 7. Measurement
I1_Ref=BU_Ictrl_I1_PI_Reg.Ref* (float)163.83; //Activate for 3. to 7. Measurement

I have done a test with single float multiplication, and got 101 ns, For the 2 floating multiplications I got 200ns. The difference is not so big.

For more information you can send the email directly to me (guangyu.wang@infineon.com).
0 Likes
Not applicable
Dear gwang,

what is your favorite beer brand? If I ever meet you I pay for a round. 😉 I feel a little ashamed but you're absolutely right, this simple thing solved everything. I was so sure that the Compiler intrepretes a constant like 123.45 always as float, that I was blind for this. With a try and error approach I was slowly getting on the same track, that only the multiplications with constants are slowly, but you definitely saved me a lot of time, big thanks for that.

If it is ok I will use you're contact information if I keep having problems with including of the exp() function (a section in the .map file overlaps as soon as the corresponding math libs are invoked).

Best regards.
0 Likes
User6412
Level 4
Level 4
Just add to the compiler settings (under Miscellaneous/Other Flags) this: -Wdouble-promotion
and you will see the messages about implicit ("hidden") usage of double precision like this: "warning: implicit conversion from 'float32_t' to 'double' to match other operand of binary expression"
Instead of "(float)163.83" you can write 163.83f
The CMSIS/Include/arm_math.h library has this problem, be careful.
0 Likes