Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

AURIX™

User19083
Level 1
Level 1
I was checking documentation on the LMU of TC297. I stumbled upon manual section "LMU SRAM Read Buffers", their existence, enabling quicker reads for sequential 32-bit accesses.

As I see that those buffers are not enabled by default, I'm trying to enable them in startup sequence of my software. LMU_MODULE.BUFCON.ENy = 0 disabled (x=0,1,2, y=1,2) (previously setting up the Master TAG ID in SRI I want to read from).

Before that, I also wanted to see performance gain in LMU sequential accesses, using Performance Counter Regs. And I would conclude that there is no performance gain after (my attempt to) enabling LMU read buffers 😕

Below is code snipped I'm using for the check (DCACHE is disabled):

      
for(int i=0; i<1000; ++i) {
my_global_array = (unsigned int)rand();
}

for(int i=0; i<2; ++i) {
unlock_wdtcon();
// Setup tag1 of bufcon0 for cpu0
MODULE_LMU.BUFCON[0].B.TAG1 = 0b000000;
MODULE_LMU.BUFCON[0].B.EN1 = i%2;

// Make sure all others are disabled
MODULE_LMU.BUFCON[0].B.EN2 = 0;

MODULE_LMU.BUFCON[1].B.EN1 = 0;
MODULE_LMU.BUFCON[1].B.EN2 = 0;

MODULE_LMU.BUFCON[2].B.EN1 = 0;
MODULE_LMU.BUFCON[2].B.EN2 = 0;
lock_wdtcon();

__mtcr(CPU_CCTRL, 0); // stop PMC counters
__mtcr(CPU_CCNT, 0); // clear CCNT
__mtcr(CPU_CCTRL, 2); // start PMC counters

for(int i=0; i<1000; i++) {

a = my_global_array;
//a = (int)(*(unsigned int *)(0xb0000000 + i*4));
}

__mtcr(CPU_CCTRL, 0); // stop PMC counters

ccnt_res = __mfcr(CPU_CCNT);
}

gain_by_readbuf = (int)ccnt_res[0]-ccnt_res[1];


asm volatile ( "debug \t\n"::);


Where “my_global_array” is a 1000 element int array in LMU/SRAM (also tried with directly putting addr non-cached for LMU, commented in code above).

I’m running that on core0, master TAG ID 0b00000 (DM.NonSafe TAG ID for CPU0). I also tried with 0b00001 (DMI.Safe TAG ID for CPU0) – don’t really know the diff between ‘Safe’ and ‘NonSafe’ as DMI master (if any info or ideas on that, would also be happy to know :D)

Performance gain for 1000x executions in the loop above is pretty much irrelevant (about 15 cycles or similar), it doesn’t seem the effect of having enabled a buffer.

In fact, if I try forcing all LMU read buffers off, I’ll get same result. I was expecting ‘gain_by_readbuf’ to give some relevant gain in cycles once the buffer is enabled.

Therefore, I conclude that this buffer is just not becoming enabled in any case. No mention to any issue in the erratas of the chip (or didn't see it), with regard to those (otherwise documented) read buffers.

I wonder if anybody has come accross similar issue, or can detect any setting that I could be overlooking in my setup.

Thanks!
0 Likes
7 Replies
NeMa_4793301
Level 6
10 solutions authored 5 solutions authored First solution authored
Level 6
Hi jjorba.

For TAG IDs: the idea with Safe/Non-Safe TAG IDs is that you can use the ACCENx registers to restrict which bus masters have access to various parts of the system. No impact to LMU performance.

Just to be sure - is a a volatile variable? If not, the compiler will optimize the loop away and just do the last assignment of a=my_global_array[999].
Are you sure that DCACHE is not enabled? Verify CPU0_DCON0.DCBYP=1.
Can you share the assembly code for your example?
0 Likes
User19083
Level 1
Level 1
Hello,

Thanks for the reply and interest.

Checked now explicitly DCBYP, and it's 1, so I believe DCACHE is bypassed, startup value.

I added now couple of lines to force it anyway - same result. Not volatile, but going with no optimization and -g setting/debug.

It looks like it does indeed have implemented the loop. Also CCNT values are consistent with the 1000x, goes around 41000 for both tests.

Seems forum won't let me attach as file/invalid file, and a bit too large to paste it in here (max 10k char, it goes to 15k) - I've put it here if want to take a look: https://bit.ly/32R6vv6

I'm running it on a Triboard with an emulation device for TC297
0 Likes
NeMa_4793301
Level 6
10 solutions authored 5 solutions authored First solution authored
Level 6
That's generating some very odd code. Here's what I got for the initial rand() loop:

.L21:
movh.a a15,#@his(my_global_array)
lea a15,[a15]@los(my_global_array)
.L35:
lea a12,999
.L2:
call rand
.L36:
st.w [a15+],d2
loop a12,.L2


vs. your assembly listing:

139:../src/shared_main.c **** for(int i=0; i<1000; ++i) {
179 .loc 1 140 0
180 00c0 820F mov %d15,0
181 00c2 59EFF8FF st.w [%a14]-8,%d15
182 00c6 3C14 j .L11
183 .L12:
184 .Lcontrol_flow_BBE13:
185 .Lcontrol_flow_BBB15:
140:../src/shared_main.c **** my_global_array = (unsigned int)rand();
186 .loc 1 141 0 discriminator 3
187 00c8 6D000000 call rand
**** Notice:Expanding call-R -> call-O
188 00cc 022F mov %d15,%d2
189 00ce 02F2 mov %d2,%d15
190 00d0 7B0000F0 movh %d15,hi:my_global_array
191 00d4 1B0F0030 addi %d3,%d15,lo:my_global_array
192 00d8 19EFF8FF ld.w %d15,[%a14]-8
193 00dc 062F sh %d15,2
194 00de 423F add %d15,%d3
195 00e0 60F2 mov.a %a2,%d15
196 00e2 7422 st.w [%a2]0,%d2
**** Notice:Optimizing st.w-@wd -> st.w-@d
140:../src/shared_main.c **** my_global_array = (unsigned int)rand();
197 .loc 1 140 0 discriminator 3
198 00e4 19EFF8FF ld.w %d15,[%a14]-8
199 00e8 C21F add %d15,1
200 00ea 59EFF8FF st.w [%a14]-8,%d15
201 .L11:
202 .Lcontrol_flow_BBE15:
203 .Lcontrol_flow_BBB14:
140:../src/shared_main.c **** my_global_array = (unsigned int)rand();
204 .loc 1 140 0 is_stmt 0 discriminator 1
205 00ee 19EFF8FF ld.w %d15,[%a14]-8
206 00f2 3B803E20 mov %d2,1000
207 00f6 3F2FE97F jlt %d15,%d2,.L12


What compiler + settings are you using? How is my_global_array declared?
0 Likes
User19083
Level 1
Level 1
Hello,

It's HighTec, --version is:

tricore-gcc (HighTec Release HDP-v4.9.3.0-infineon-1.0-fb21a99) 4.9.4 build on 2019-06-07

Settings for tricore-gcc are:

"C:\HighTec\toolchains\tricore\v4.9.3.0-infineon-1.0/bin/tricore-gcc" -c -gdwarf-2 -I/*..header folders..*/ -fno-common -O0 -g2 -fdwarf-control-flow -W -Wall -Wextra -Wdiv-by-zero -Warray-bounds -Wcast-align -Wignored-qualifiers -Wformat -Wformat-security -Wa,-ahlms=shared_main.lst -pipe -DTC29XB -D__TC29XX__ -D__TRICORE__ -D__TC161__ -DTRIBOARD_TC2X7_V1_0 -D__GNUC__=4 -fshort-double -mcpu=tc29xx -mversion-info -std=gnu99 -MMD -MP -MF"src/shared_main.d" -MT"src/shared_main.o" -o "src/shared_main.o" "../src/shared_main.c"

Omitting the -I / multiple folders.

As for the declaration of my_global_array, as well all other variables:

uint32_t my_global_array[1000];

Global, outside of all functions in same .c file for which I shared .lst / shared_main.c, no further qualifier. Same with all other variables involved.
0 Likes
User16286
Level 4
First like received
Level 4
jjorba wrote:
Hello,

It's HighTec, --version is:

tricore-gcc (HighTec Release HDP-v4.9.3.0-infineon-1.0-fb21a99) 4.9.4 build on 2019-06-07

Settings for tricore-gcc are:

"C:\HighTec\toolchains\tricore\v4.9.3.0-infineon-1.0/bin/tricore-gcc" -c -gdwarf-2 -I/*..header folders..*/ -fno-common -O0 -g2 -fdwarf-control-flow -W -Wall -Wextra -Wdiv-by-zero -Warray-bounds -Wcast-align -Wignored-qualifiers -Wformat -Wformat-security -Wa,-ahlms=shared_main.lst -pipe -DTC29XB -D__TC29XX__ -D__TRICORE__ -D__TC161__ -DTRIBOARD_TC2X7_V1_0 -D__GNUC__=4 -fshort-double -mcpu=tc29xx -mversion-info -std=gnu99 -MMD -MP -MF"src/shared_main.d" -MT"src/shared_main.o" -o "src/shared_main.o" "../src/shared_main.c"

Omitting the -I / multiple folders.

As for the declaration of my_global_array, as well all other variables:

uint32_t my_global_array[1000];

Global, outside of all functions in same .c file for which I shared .lst / shared_main.c, no further qualifier. Same with all other variables involved.


Your core loop is 17 instructions.
Enabling cache saves one or two clocks on a the single memory access that is in the loop.
So because your code is so inefficient, the cache only makes the code run about 4% faster.
You need to actually understand what the processor is doing.

Toshi
0 Likes
User19083
Level 1
Level 1
Hello Toshi,

Thanks for your reply.

Improvement I get is well below 1% (about 0.1%). Apparently won't scale with number of iterations of the loop either.

This is, running original code, improvement I get is 27 cycles (at f_cpu = 300M) - exec time is 33040 cycles the first time (CCNT), 33013 the second time**.

I'd expect some multiple or divisor of 1000x as improvement, absolute, 1000 being number of accesses at LMU.

I presume curr improvement is not related to enablement of LMU buffers, but anything else (maybe pcache, i'm running from pflash).

**In original post I referred to 41K cycles for the CCNT reading, getting now 33K cycles. I might have been using different surrounding settings or maybe code was slightly diff. Poor or non-existing improvement result is same though.


Below is a code more concise in the loop, showing same problem - result is that both first and second exec of loop take exactly 13005 cycles at CCNT in my setup - no change / improvement.

Buffers don't seem to become enabled using that code.



for(int i=0; i<2; ++i) {
unlock_wdtcon();
// Setup tag1 of bufcon0 for cpu0
MODULE_LMU.BUFCON[0].B.TAG1 = 0b000000;
MODULE_LMU.BUFCON[0].B.EN1 = i%2;

// Make sure all others are disabled
MODULE_LMU.BUFCON[0].B.EN2 = 0;

MODULE_LMU.BUFCON[1].B.EN1 = 0;
MODULE_LMU.BUFCON[1].B.EN2 = 0;

MODULE_LMU.BUFCON[2].B.EN1 = 0;
MODULE_LMU.BUFCON[2].B.EN2 = 0;
lock_wdtcon();

// Load loop iterations in a3
asm volatile (
"mov %%d4,1000-1 \n\t"
"mov.a %%a3,%%d4 \n\t"
::);

uint32_t lmu_addr = 0xb0003000; // Some LMU addr / non-cached flavor

// Prepare first (LMU) address to load in %%a6
asm volatile (
"ld.w %%d4, %[l_lmu_addr] \n\t"
"mov.a %%a6, %%d4 \n\t"
:: [l_lmu_addr] "m" (lmu_addr));

// Prepare shift to add.a in every iter in %%a7, 4 byte
asm volatile (
"mov %%d4, 4 \n\t"
"mov.a %%a7, %%d4 \n\t"
::);

__mtcr(CPU_CCTRL, 0); // stop counters
__mtcr(CPU_CCNT, 0); // clear CCNT
_isync();
__mtcr(CPU_CCTRL, 2); // start counters

asm volatile (
".lmu_buff_test_loop: \n\t"
"ld.w %%d0, [%%a6] \n\t" /* LMU access */
"add.a %%a6, %%a7 \n\t" /* update address, +4 */
"loop %%a3, .lmu_buff_test_loop \n\t"
::);

__mtcr(CPU_CCTRL, 0); // stop counters

ccnt_res = __mfcr(CPU_CCNT);
}

gain_by_readbuf = (int)ccnt_res[0]-ccnt_res[1];

0 Likes
NeMa_4793301
Level 6
10 solutions authored 5 solutions authored First solution authored
Level 6
Hi jjorba. Could it be that the Safe bit is set (PSW.S=1)? That will change the CPU tag ID. EDIT: I see you tried that earlier, so never mind.

Try setting LMU_BUFCONx.TAG1 and TAG2 to match the TAG assignments of the CPU DMI interfaces, per Table 2-15. There are two tag IDs per BUFCON, so each can support the non-safe ID (PSW.S=0) and the safe ID (PSW.S=1):

All together, like so:
LMU_BUFCON0.U = 0xC0000100;  // EN2=1, EN1=1, TAG2=1 (CPU0 when PSW.S=1), TAG1=0 (PSW.S=0)
LMU_BUFCON1.U = 0xC0000302; // EN2=1, EN1=1, TAG2=3 (CPU1 when PSW.S=1), TAG1=2 (PSW.S=0)
LMU_BUFCON2.U = 0xC0000504; // EN2=1, EN1=1, TAG2=5 (CPU2 when PSW.S=1), TAG1=4 (PSW.S=0)


One other idea - can you try segment 9 (cached) instead of segment B (non-cached)?

a = (int)(*(unsigned int *)(0x90000000 + i*4));
0 Likes