Can't enable LMU/SRAM read buffers in TC29x

User19083 · ‎Mar 04, 2020

I was checking documentation on the LMU of TC297. I stumbled upon manual section "LMU SRAM Read Buffers", their existence, enabling quicker reads for sequential 32-bit accesses.

As I see that those buffers are not enabled by default, I'm trying to enable them in startup sequence of my software. LMU_MODULE.BUFCON.ENy = 0 disabled (x=0,1,2, y=1,2) (previously setting up the Master TAG ID in SRI I want to read from).

Before that, I also wanted to see performance gain in LMU sequential accesses, using Performance Counter Regs. And I would conclude that there is no performance gain after (my attempt to) enabling LMU read buffers 😕

Below is code snipped I'm using for the check (DCACHE is disabled):

      
for(int i=0; i<1000; ++i) {
            my_global_array = (unsigned int)rand();
      }
 
      for(int i=0; i<2; ++i) {
            unlock_wdtcon();
                  // Setup tag1 of bufcon0 for cpu0
            MODULE_LMU.BUFCON[0].B.TAG1 = 0b000000;
            MODULE_LMU.BUFCON[0].B.EN1 = i%2;
 
                  // Make sure all others are disabled
            MODULE_LMU.BUFCON[0].B.EN2 = 0;
 
            MODULE_LMU.BUFCON[1].B.EN1 = 0;
            MODULE_LMU.BUFCON[1].B.EN2 = 0;
 
            MODULE_LMU.BUFCON[2].B.EN1 = 0;
            MODULE_LMU.BUFCON[2].B.EN2 = 0;
            lock_wdtcon();
 
            __mtcr(CPU_CCTRL, 0);   // stop PMC counters
            __mtcr(CPU_CCNT, 0);    // clear CCNT
            __mtcr(CPU_CCTRL, 2);   // start PMC counters
 
            for(int i=0; i<1000; i++) {
 
                  a = my_global_array;
                  //a = (int)(*(unsigned int *)(0xb0000000 + i*4));
            }
 
            __mtcr(CPU_CCTRL, 0);   // stop PMC counters
 
            ccnt_res = __mfcr(CPU_CCNT);
      }
 
      gain_by_readbuf = (int)ccnt_res[0]-ccnt_res[1];
 
 
      asm volatile ( "debug \t\n"::);

Where “my_global_array” is a 1000 element int array in LMU/SRAM (also tried with directly putting addr non-cached for LMU, commented in code above).

I’m running that on core0, master TAG ID 0b00000 (DM.NonSafe TAG ID for CPU0). I also tried with 0b00001 (DMI.Safe TAG ID for CPU0) – don’t really know the diff between ‘Safe’ and ‘NonSafe’ as DMI master (if any info or ideas on that, would also be happy to know :D)

Performance gain for 1000x executions in the loop above is pretty much irrelevant (about 15 cycles or similar), it doesn’t seem the effect of having enabled a buffer.

In fact, if I try forcing all LMU read buffers off, I’ll get same result. I was expecting ‘gain_by_readbuf’ to give some relevant gain in cycles once the buffer is enabled.

Therefore, I conclude that this buffer is just not becoming enabled in any case. No mention to any issue in the erratas of the chip (or didn't see it), with regard to those (otherwise documented) read buffers.

I wonder if anybody has come accross similar issue, or can detect any setting that I could be overlooking in my setup.

Thanks!

NeMa_4793301 · ‎Mar 04, 2020

Hi jjorba.

For TAG IDs: the idea with Safe/Non-Safe TAG IDs is that you can use the ACCENx registers to restrict which bus masters have access to various parts of the system. No impact to LMU performance.

Just to be sure - is a a volatile variable? If not, the compiler will optimize the loop away and just do the last assignment of a=my_global_array[999].
Are you sure that DCACHE is not enabled? Verify CPU0_DCON0.DCBYP=1.
Can you share the assembly code for your example?

User19083 · ‎Mar 05, 2020

Hello,

Thanks for the reply and interest.

Checked now explicitly DCBYP, and it's 1, so I believe DCACHE is bypassed, startup value.

I added now couple of lines to force it anyway - same result. Not volatile, but going with no optimization and -g setting/debug.

It looks like it does indeed have implemented the loop. Also CCNT values are consistent with the 1000x, goes around 41000 for both tests.

Seems forum won't let me attach as file/invalid file, and a bit too large to paste it in here (max 10k char, it goes to 15k) - I've put it here if want to take a look: https://bit.ly/32R6vv6

I'm running it on a Triboard with an emulation device for TC297

NeMa_4793301 · ‎Mar 05, 2020

That's generating some very odd code. Here's what I got for the initial rand() loop:

.L21:
	movh.a	a15,#@his(my_global_array)
	lea	a15,[a15]@los(my_global_array)
.L35:
	lea	a12,999
.L2:
	call	rand
.L36:
	st.w	[a15+],d2
	loop	a12,.L2

vs. your assembly listing:


 139:../src/shared_main.c **** 			for(int i=0; i<1000; ++i) {
 179              	.loc 1 140 0
 180 00c0 820F     	mov %d15,0
 181 00c2 59EFF8FF 	st.w [%a14]-8,%d15
 182 00c6 3C14     	j .L11
 183              	.L12:
 184              	.Lcontrol_flow_BBE13:
 185              	.Lcontrol_flow_BBB15:
 140:../src/shared_main.c **** 			            my_global_array = (unsigned int)rand();
 186              	.loc 1 141 0 discriminator 3
 187 00c8 6D000000 	call rand
****  Notice:Expanding call-R -> call-O
 188 00cc 022F     	mov %d15,%d2
 189 00ce 02F2     	mov %d2,%d15
 190 00d0 7B0000F0 	movh %d15,hi:my_global_array
 191 00d4 1B0F0030 	addi %d3,%d15,lo:my_global_array
 192 00d8 19EFF8FF 	ld.w %d15,[%a14]-8
 193 00dc 062F     	sh %d15,2
 194 00de 423F     	add %d15,%d3
 195 00e0 60F2     	mov.a %a2,%d15
 196 00e2 7422     	st.w [%a2]0,%d2
****  Notice:Optimizing st.w-@wd -> st.w-@d
 140:../src/shared_main.c **** 			            my_global_array = (unsigned int)rand();
 197              	.loc 1 140 0 discriminator 3
 198 00e4 19EFF8FF 	ld.w %d15,[%a14]-8
 199 00e8 C21F     	add %d15,1
 200 00ea 59EFF8FF 	st.w [%a14]-8,%d15
 201              	.L11:
 202              	.Lcontrol_flow_BBE15:
 203              	.Lcontrol_flow_BBB14:
 140:../src/shared_main.c **** 			            my_global_array = (unsigned int)rand();
 204              	.loc 1 140 0 is_stmt 0 discriminator 1
 205 00ee 19EFF8FF 	ld.w %d15,[%a14]-8
 206 00f2 3B803E20 	mov %d2,1000
 207 00f6 3F2FE97F 	jlt %d15,%d2,.L12

What compiler + settings are you using? How is my_global_array declared?

User19083 · ‎Mar 06, 2020

Hello,

It's HighTec, --version is:

tricore-gcc (HighTec Release HDP-v4.9.3.0-infineon-1.0-fb21a99) 4.9.4 build on 2019-06-07

Settings for tricore-gcc are:

"C:\HighTec\toolchains\tricore\v4.9.3.0-infineon-1.0/bin/tricore-gcc" -c -gdwarf-2 -I/*..header folders..*/ -fno-common -O0 -g2 -fdwarf-control-flow -W -Wall -Wextra -Wdiv-by-zero -Warray-bounds -Wcast-align -Wignored-qualifiers -Wformat -Wformat-security -Wa,-ahlms=shared_main.lst -pipe -DTC29XB -D__TC29XX__ -D__TRICORE__ -D__TC161__ -DTRIBOARD_TC2X7_V1_0 -D__GNUC__=4 -fshort-double -mcpu=tc29xx -mversion-info -std=gnu99 -MMD -MP -MF"src/shared_main.d" -MT"src/shared_main.o" -o "src/shared_main.o" "../src/shared_main.c"

Omitting the -I / multiple folders.

As for the declaration of my_global_array, as well all other variables:

uint32_t my_global_array[1000];

Global, outside of all functions in same .c file for which I shared .lst / shared_main.c, no further qualifier. Same with all other variables involved.

User16286 · ‎Mar 06, 2020

jjorba wrote:
Hello,

It's HighTec, --version is:

tricore-gcc (HighTec Release HDP-v4.9.3.0-infineon-1.0-fb21a99) 4.9.4 build on 2019-06-07

Settings for tricore-gcc are:

"C:\HighTec\toolchains\tricore\v4.9.3.0-infineon-1.0/bin/tricore-gcc" -c -gdwarf-2 -I/*..header folders..*/ -fno-common -O0 -g2 -fdwarf-control-flow -W -Wall -Wextra -Wdiv-by-zero -Warray-bounds -Wcast-align -Wignored-qualifiers -Wformat -Wformat-security -Wa,-ahlms=shared_main.lst -pipe -DTC29XB -D__TC29XX__ -D__TRICORE__ -D__TC161__ -DTRIBOARD_TC2X7_V1_0 -D__GNUC__=4 -fshort-double -mcpu=tc29xx -mversion-info -std=gnu99 -MMD -MP -MF"src/shared_main.d" -MT"src/shared_main.o" -o "src/shared_main.o" "../src/shared_main.c"

Omitting the -I / multiple folders.

As for the declaration of my_global_array, as well all other variables:

uint32_t my_global_array[1000];

Global, outside of all functions in same .c file for which I shared .lst / shared_main.c, no further qualifier. Same with all other variables involved.

Your core loop is 17 instructions.
Enabling cache saves one or two clocks on a the single memory access that is in the loop.
So because your code is so inefficient, the cache only makes the code run about 4% faster.
You need to actually understand what the processor is doing.

Toshi

User19083 · ‎Mar 09, 2020

Hello Toshi,

Thanks for your reply.

Improvement I get is well below 1% (about 0.1%). Apparently won't scale with number of iterations of the loop either.

This is, running original code, improvement I get is 27 cycles (at f_cpu = 300M) - exec time is 33040 cycles the first time (CCNT), 33013 the second time**.

I'd expect some multiple or divisor of 1000x as improvement, absolute, 1000 being number of accesses at LMU.

I presume curr improvement is not related to enablement of LMU buffers, but anything else (maybe pcache, i'm running from pflash).

**In original post I referred to 41K cycles for the CCNT reading, getting now 33K cycles. I might have been using different surrounding settings or maybe code was slightly diff. Poor or non-existing improvement result is same though.

Below is a code more concise in the loop, showing same problem - result is that both first and second exec of loop take exactly 13005 cycles at CCNT in my setup - no change / improvement.

Buffers don't seem to become enabled using that code.



			  for(int i=0; i<2; ++i) {
					unlock_wdtcon();
						  // Setup tag1 of bufcon0 for cpu0
					MODULE_LMU.BUFCON[0].B.TAG1 = 0b000000;
					MODULE_LMU.BUFCON[0].B.EN1 = i%2;
			
						  // Make sure all others are disabled
					MODULE_LMU.BUFCON[0].B.EN2 = 0;
			
					MODULE_LMU.BUFCON[1].B.EN1 = 0;
					MODULE_LMU.BUFCON[1].B.EN2 = 0;
			
					MODULE_LMU.BUFCON[2].B.EN1 = 0;
					MODULE_LMU.BUFCON[2].B.EN2 = 0;
					lock_wdtcon();
			
					// Load loop iterations in a3
					asm volatile (
						"mov	%%d4,1000-1		\n\t"
						"mov.a 	%%a3,%%d4		\n\t"
					::);
			
					uint32_t lmu_addr = 0xb0003000;	// Some LMU addr / non-cached flavor
			
					// Prepare first (LMU) address to load in %%a6
					asm volatile (
						"ld.w	%%d4, %[l_lmu_addr]	\n\t"
						"mov.a 	%%a6, %%d4			\n\t"
					:: [l_lmu_addr] "m" (lmu_addr));
			
					// Prepare shift to add.a in every iter in %%a7, 4 byte
					asm volatile (
						"mov	%%d4, 4	\n\t"
						"mov.a 	%%a7, %%d4			\n\t"
					::);
			
					__mtcr(CPU_CCTRL, 0);   // stop counters
					__mtcr(CPU_CCNT, 0);    // clear CCNT
					_isync();
					__mtcr(CPU_CCTRL, 2);   // start counters
			
					asm volatile (
							".lmu_buff_test_loop:			\n\t"
							"ld.w %%d0, [%%a6]			\n\t"	/* LMU access */
							"add.a %%a6, %%a7			\n\t"	/* update address, +4 */
							"loop %%a3, .lmu_buff_test_loop	\n\t"
					::);
			
					__mtcr(CPU_CCTRL, 0);   // stop counters
			
					ccnt_res = __mfcr(CPU_CCNT);
			  }
			
			  gain_by_readbuf = (int)ccnt_res[0]-ccnt_res[1];

NeMa_4793301 · ‎Mar 24, 2020

Hi jjorba. Could it be that the Safe bit is set (PSW.S=1)? That will change the CPU tag ID. EDIT: I see you tried that earlier, so never mind.

Try setting LMU_BUFCONx.TAG1 and TAG2 to match the TAG assignments of the CPU DMI interfaces, per Table 2-15. There are two tag IDs per BUFCON, so each can support the non-safe ID (PSW.S=0) and the safe ID (PSW.S=1):

All together, like so:

LMU_BUFCON0.U = 0xC0000100;  // EN2=1, EN1=1, TAG2=1 (CPU0 when PSW.S=1), TAG1=0 (PSW.S=0)
LMU_BUFCON1.U = 0xC0000302;  // EN2=1, EN1=1, TAG2=3 (CPU1 when PSW.S=1), TAG1=2 (PSW.S=0)
LMU_BUFCON2.U = 0xC0000504;  // EN2=1, EN1=1, TAG2=5 (CPU2 when PSW.S=1), TAG1=4 (PSW.S=0)

One other idea - can you try segment 9 (cached) instead of segment B (non-cached)?

a = (int)(*(unsigned int *)(0x90000000 + i*4));

Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x

Re: Can't enable LMU/SRAM read buffers in TC29x