What is TC 1.6 and TC 1.6.1 architecture difference ?

User21707 · ‎Apr 25, 2021

I am flashing an AUTOSAR compliant code on TC297TP and TC375TE. However the throughput for an API on TC297TE is 4 times the throughput calculated on TC375TP but the CPU freq and STM freq are same on both boards
i.e 200 MHz and 100 MHz respectively. Compiler used is Tasking 6.2r2.

I read in few AURIX trainings that 1.6P and 1.6E architecture have different instruction cycles i.e 6 and 4 respectively but I couldn't find this in any manual. I am aware TC297 uses 1.6 and TC375 uses 1.6.1 architecture but,
how do I figure from 1.6P/1.6E and 1.61P/1.6.1E is being used on the respective boards ?

What could be the possible reason for the 4Times throughput value for same code, same compiler flags etc. Everything same only different board ?

Please help !!!!

NeMa_4793301 · ‎Apr 25, 2021

The TC297 has three 1.6P (performance) cores with six-stage pipelines, 300 MHz max clock. All TC3xx micros have 1.6.2P performance cores, and the TC375 maximum speed is 300 MHz. TC1.6E (efficiency) cores with a 4-stage pipeline are present on the TC23x, TC26x (one of two cores), and TC27x (one of three cores).

The TC3xx has a couple new instructions that can provide a very very small performance boost. If you don't enable Software Over The Air (SOTA), the direct local path between CPU0 and PFLASH0 (and CPU1 / PFLASH1, etc.) can be up to a 20% performance boost.

If you're seeing a 4x difference, I suspect that your TC297 application has program cache enabled (CPUx_PCON0.PCBYP=0) and data cache enabled (CPUx_DCON0.DCBYP=0), while your TC375 application does not.

If PCON0 and DCON0 are the same, then it must be a difference in clock speed. Post your PLL and CCUCONx registers and we'll figure it out.

Note also that both TC297 and TC375 can run at 300 MHz.

User21707 · ‎Apr 25, 2021

UC_wrangler wrote:
The TC297 has three 1.6P (performance) cores with six-stage pipelines, 300 MHz max clock. All TC3xx micros have 1.6.2P performance cores, and the TC375 maximum speed is 300 MHz. TC1.6E (efficiency) cores with a 4-stage pipeline are present on the TC23x, TC26x (one of two cores), and TC27x (one of three cores).

Can you please help me with the information for:
1. Where (document / link / chapter) can I get this particular information about the different stages of pipelines of the mentioned architecture. (i.e the details such as1.6P has 6 stages , 1.6E has 4 stages etc.)
2. How to figure out or is it mentioned anywhere what kind (Performance / Efficiency) core are present for a particular board.

UC_wrangler wrote:

The TC3xx has a couple new instructions that can provide a very very small performance boost. If you don't enable Software Over The Air (SOTA), the direct local path between CPU0 and PFLASH0 (and CPU1 / PFLASH1, etc.) can be up to a 20% performance boost.

If you're seeing a 4x difference, I suspect that your TC297 application has program cache enabled (CPUx_PCON0.PCBYP=0) and data cache enabled (CPUx_DCON0.DCBYP=0), while your TC375 application does not.

If PCON0 and DCON0 are the same, then it must be a difference in clock speed. Post your PLL and CCUCONx registers and we'll figure it out.

Note also that both TC297 and TC375 can run at 300 MHz.

I am sharing the PCON,DCON,PLL,CCUCON register values for TC375TE and TC297TF. Please have a look.
PCON0 is "Not Bypassed" in both and DCON0 is "Bypassed" in TC297TF and "Enabled" in TC375TE. Also the PLL is set to 200MHz and STM to 100MHz for both the boards. This I have verified by the register values, ticks, measuring time manually on stopwatch.

NeMa_4793301 · ‎Apr 25, 2021

For TC2xx CPU details, refer to these sections in the TC29x User Manual:

5.3.3 [TC1.6P] Execution Unit
The Execution Unit contains the Integer Pipeline, the Load/Store Pipeline and the Loop Pipeline. All three pipelines operate in parallel, permitting up to three instructions to be executed in one clock cycle. In the execution unit all instructions pass through a decode stage followed by two execute stages. Pipeline hazards (stalls) are minimised by the use of forwarding paths between pipeline stages allowing the results of one instruction to be used by a following instruction as soon as the result becomes available.

5.4.3 [TC1.6E] Execution Unit
The Execution Unit contains the Integer Pipeline and the Load/Store Pipeline. In TC1.6E, loop instructions are always executed by the Load/Store pipeline. The TC1.6E issues a single instruction per clock cycle, and as such no more than one instruction will be executed in one clock cycle.

TC2xx CPU types are listed in the block diagram at the start of the User Manual.

PLLs and clocks look OK at first glance - both CPUs at 200 MHz, STM at 100 MHz.

Can you check the PFLASH wait state settings next? TC2xx: FCON0, and TC3xx: HF_PWAIT.

User21707 · ‎Apr 25, 2021

1. Is there any difference between 1.6P and 1.6.2P and 1.6E and 1.6.2E or any comparative study available in this regards ? If there are differences, then the way pipelining is mentioned in TC2xx UM, any mention of 1.6.1 in any document (wasnt mentioned in TC3xx UM) ?

2. Where does it mention that all TC3xx cores would be using TC 1.6.1P ? Because in TC37xx Expert trainings, 1.6.1P and 1.6.1E both are mentioned in system architecture) and TC3xx block diagram doesn't have the depiction like that of TC2xx.

UC_wrangler wrote:
Can you check the PFLASH wait state settings next? TC2xx: FCON0, and TC3xx: HF_PWAIT.

Attaching the PFLAHS and DFLASH register values for TC375 and TC297. I think there is a difference.

Thank you for the information on TC1.6 pipelining stages !!

**EDIT**
Even after making the Pflash and Dflash register values same in both scenarios (made the TC297TF same as that of TC375TP) , still the values are 4x . Maybe I am not setting it to proper value in TC297 or TC375.

NeMa_4793301 · ‎Apr 26, 2021

#1: Pipelines are identical between 1.6P and 1.6.2P. There is no 1.6.2E.

#2: See Table 9 Platform Feature Overview.

For 200 MHz, the wait states should be:
TC375: HF_PWAIT.RFLASH=5, RECC=1, HF_DWAIT.RFLASH=9, ECC=1
TC297: FCON.WSPFLASH=5, WSECPF=1, WSDFLASH=9, WSECDF=1

Right now, your wait states for TC297 are 6+2, vs. TC375 12+5 - that can certainly explain some of the slowdown. The default values of the TC375 are slower, because customers often set the clock to the max value of 300 MHz without remembering to set HF_PWAIT. That leads to ECC errors reading from PFLASH. So, the default value was increased to work at all frequencies.

User21707 · ‎Apr 26, 2021

If the default values for TC375 are slower, then the throughput / timing achieved from TC375 should have been more than that of TC297 but in my scenario, the TC375 timing values are lesser than that of TC297.

Secondly, for calculation of the PWAIT / FCON register values, the constant parameter MAX values i.e :
Data Flash access delay tDF - 100 ns
Data Flash ECC Delay tDFECC - 20 ns
Program Flash access delay tPF - 30 ns
Program Flash ECC delay tPFECC - 10 ns , are considered or something else ? and the corresponding Frequency value for the register value calculations (Ffsi / Fsri /Ffsi2) are taken as that of which we would have got after setting of the desired PLL (in this case 200 MHz). Right ?

In TC375 example for PLL settings, there wasn't wait state changes mentioned to be done. So what all hidden parameters / procedures are present that needs to be taken care of while the upscaling process. and what method would you suggest for the verification of the PLL settings apart from use of CRO ?

Also, the DCON0 is different in both scenarios. Doesn't that impact in the performance ?

NeMa_4793301 · ‎Apr 26, 2021

If the default values for TC375 are slower, then the throughput / timing achieved from TC375 should have been more than that of TC297 but in my scenario, the TC375 timing values are lesser than that of TC297.

Are we talking about time here, or throughput? I take throughput to mean "iterations of code per second", for example - i.e., a frequency, like DMIPS. So higher throughput is better performance.

For both TC2xx and TC3xx, the wait state registers specify a number of CPU clock cycles to wait. You appear to be using 200 MHz FSI2 / SRI / CPU, so the values I listed previously should be correct.

Yes, DCON0 can make a difference too. But then the placement of variables starts to be important. If a CPU is only accessing variables in its local DSPR, data cache is not used, so DCON0 is irrelevant. If a CPU is accessing remote data (other CPU DSPR, DLMU, or LMU), data cache can have a big impact.

User21707 · ‎Apr 26, 2021

UC_wrangler wrote:
Are we talking about time here, or throughput? I take throughput to mean "iterations of code per second", for example - i.e., a frequency, like DMIPS. So higher throughput is better performance.

Sorry for the confusion. I take timing of the API and throughput (enter and exit timing for a particular API) as synonymous terms. So basically I mean lesser the throughput / time better is the performance. For simplicity I shall term it as timing henceforth.

So back to the original question, if the values of wait state registers are less => wait time is less =>.API execution is faster. Hence what would you suggest for instance, for 200 Mhz, the minimum wait time and maximum wait time range ? Clearly DCON0 isn't making an impact for my case as it is not accessing other CPU DSPR etc.

Even with the mentioned register values, timing values of APIs for TC297 is 2x to that of TC375. What could be any other possible reason?

NeMa_4793301 · ‎Apr 26, 2021

Next, I would look at your map file - where are the variables allocated? If they are in dLMU on the TC3xx (see the Memory Map chapter) and in LMU RAM on the TC2xx, I would not be surprised to see a 2x speed improvement on the TC3xx. The local path between CPU0 and dLMU0 is 0 clocks for a read and 2 clocks for a write on TC3xx, vs. 5 cycles for TC2xx LMU RAM.

Data cache can also have a big impact: the TC3xx has 16K, vs. TC29x 8K. I've seen some benchmarks make an 8x difference in execution time when everything fits in data cache. The results are incredibly application dependent and difficult to predict. You might also try enabling the performance counters and measuring PCACHE and DCACHE misses.

User21707 · ‎Apr 27, 2021

The PCACHE miss for TC297 is 35 and in TC375 is 12, and DCACHE miss is 33 in TC297 and in TC375 is 12. But how are these values co-related / what do they depict ?

Also have attached the memap and linker file for respective.

ScottW · ‎Apr 27, 2021

A cache miss is when the CPU tries to access instructions or data that are not already stored in the local cache. Every cache miss represents a potential stall, as the requested data then need to be retrieved from RAM with its attendant wait states. More cache misses will result in slower performance, so the numbers you're seeing are consistent with the higher performance of the TC375.

NeMa_4793301 · ‎Apr 27, 2021

Your map files show that the TC3xx is linking variables to DSPR2, while the TC2xx is linking to LMU RAM. You have about 15K of variables, which fits entirely in the TC3xx 16K data cache, but overflows the TC2xx 8K data cache. That could explain the vast majority of the performance difference.

For best results, use the DSPR associated with that core (e.g., DSPR0 for CPU0, DSPR1 for CPU1, etc.). When a CPU uses its own DSPR, data cache is bypassed. You'll likely see a very small improvement on TC3xx (because the data is less than the size of the data cache), but a huge improvement on TC2xx.

User21707 · ‎Apr 28, 2021

Scott Winder wrote:
A cache miss is when the CPU tries to access instructions or data that are not already stored in the local cache. Every cache miss represents a potential stall, as the requested data then need to be retrieved from RAM with its attendant wait states. More cache misses will result in slower performance, so the numbers you're seeing are consistent with the higher performance of the TC375.

Scott, thank you for the brief idea on PCACHE,DCAHCE !

UC_wrangler wrote:
Your map files show that the TC3xx is linking variables to DSPR2, while the TC2xx is linking to LMU RAM. You have about 15K of variables, which fits entirely in the TC3xx 16K data cache, but overflows the TC2xx 8K data cache. That could explain the vast majority of the performance difference.

For best results, use the DSPR associated with that core (e.g., DSPR0 for CPU0, DSPR1 for CPU1, etc.). When a CPU uses its own DSPR, data cache is bypassed. You'll likely see a very small improvement on TC3xx (because the data is less than the size of the data cache), but a huge improvement on TC2xx.

Wrangler, Thank you very much for the detailed information. I am little not confident on memory sections as I haven't worked on them much. Can you please help me with the following information :

1. How did you know there are 15K varriables ?
2. In TC29xx, atleast 8K of variables should have been to data cache. But where are those ? [Because code used is same, hence amount of variables will be same too]
3. How do I restrict to use only DSPR0 for CPU0 as you suggested and also how can I restrict to use LMU ram only (just in case) ?

Also, if I wish to improve my knowledge on memory sections, interpreting memory maps, where should I start from (any tips ? ) ?

NeMa_4793301 · ‎Apr 29, 2021

#1: This section of your map file summarizes the RAM usage:

***********************************************************************  Used Resources  ***********************************************************************

* Memory usage in bytes
========================
+------------------------------------------------------------------------+
| Memory          | Code     | Data     | Reserved | Free     | Total    |
|========================================================================|
| mpe:dlmucpu0    |      0x0 |      0x0 |      0x0 | 0x010000 | 0x010000 |
| mpe:dlmucpu1    |      0x0 |      0x0 |      0x0 | 0x010000 | 0x010000 |
| mpe:dlmucpu2    |      0x0 |      0x0 |      0x0 | 0x010000 | 0x010000 |
| mpe:dspr0       |      0x0 | 0x000080 | 0x005400 | 0x036b80 | 0x03c000 |
| mpe:dspr1       |      0x0 | 0x000080 | 0x001000 | 0x03af80 | 0x03c000 |
| mpe:dspr2       |      0x0 | 0x003d82 | 0x001000 | 0x01327e | 0x018000 |

DSPR2 => 0x3D82 => 15746 bytes

#2: Variables do not get allocated to data cache - the cache is dynamic. Each reference to a cacheable 32-byte section that is not present in the data cache will pull in a new entry of 32 bytes. The Least Recently Used algorithm means that If there are more than 8K/32 => 256 entries, the oldest entry ages out. If the oldest entry was written to by the CPU, it is written back to memory (writeback) before it is replaced with a new entry.

#3: Every compiler has its own mysterious methods for linking. In the Tasking toolchain, the LSL file specifies a priority for each section. Unless specified otherwise, variables are allocated to the highest priority number sections first. You could change the priority for DSPR0 to 99:

        memory dspr0 // Data Scratch Pad Ram CPU0
        {
                mau = 8;
                size = 240k;
                type = ram;
                map (dest=bus:tc0:fpi_bus, dest_offset=0xd0000000, size=240k, priority=99, exec_priority=0);
                map (dest=bus:sri, dest_offset=0x70000000, size=240k);
        }

You can also allocate specific sections to specific locations. See 1.12. Compiler Generated Sections in the Tasking ctc_user_guide.pdf. You can also use the very non-ANSI C shortcut __at() in Tasking to force a variable to a certain address:

unsigned char demo_array[4096] __at(0x60001000);

The attached presentation condenses a lot of topics about RAM together. I am sure it will lead to a few new questions :).

User21707 · ‎Apr 29, 2021

UC_wrangler wrote:

#3: Every compiler has its own mysterious methods for linking. In the Tasking toolchain, the LSL file specifies a priority for each section. Unless specified otherwise, variables are allocated to the highest priority number sections first. You could change the priority for DSPR0 to 99:
        memory dspr0 // Data Scratch Pad Ram CPU0
        {
                mau = 8;
                size = 240k;
                type = ram;
                map (dest=bus:tc0:fpi_bus, dest_offset=0xd0000000, size=240k, priority=99, exec_priority=0);
                map (dest=bus:sri, dest_offset=0x70000000, size=240k);
        }
You can also allocate specific sections to specific locations. See 1.12. Compiler Generated Sections in the Tasking ctc_user_guide.pdf. You can also use the very non-ANSI C shortcut __at() in Tasking to force a variable to a certain address:
unsigned char demo_array[4096] __at(0x60001000);
The attached presentation condenses a lot of topics about RAM together. I am sure it will lead to a few new questions :).

Wrangler, for TC297 I remember the DCACHE was bypassed from the DCON register. Is it because of that reason, that memory wasn't allocated to DCACHE ? Because I understand, the memory should have filled up sequentially as per priority.

2. You mentioned about priority and I assume priority = 99 would be the highest priority. But where is the priority table ? ( Like that of the interruptus priority table )

3. But there is no way, in which I can decide CPU0 should allocate to DSPR0 and not to DSPR1. Is my understanding right ?

Thank you for the inputs on how to understand the memory sections and in regards to the compiler. Surely there would be more questions related to them. Shall put them up in a different post . [ Very soon a post is coming up in regards to section type i.e bss,zbss etc for tasking toolchain , looking forward to your help there too.]

NeMa_4793301 · ‎Apr 29, 2021

#1 Correct: with DCON0.DCBYP=1 (the default), DCACHE is bypassed. Every reference to LMU RAM goes over the SRI bus.

#2: Priority in the LSL file has nothing to do with AURIX hardware; it's telling the linker in which order to allocate variables to all of the RAM regions that are defined.

#3: CPU0 doesn't allocate memory, so long as you're not using malloc/free (dynamic allocation). For the application you've got, the memory allocation is decided at link time (i.e., static/fixed allocation). The compiler doesn't know which CPU is going to execute the code - it could be that every CPU executes a given function in parallel, or only a single CPU. It's up to you to craft a linker file that allocates variables in the optimal memories, along with an application sequence of execution that achieves the best performance.

User21707 · ‎Apr 29, 2021

Thank you Wrangler. These explanations have helped me a lot !!

What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?

Re: What is TC 1.6 and TC 1.6.1 architecture difference ?