Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob

FR How to up the SPEED efficient

FR How to up the SPEED efficient

Anonymous
Not applicable

Answer:

  1. Using the 20bit address mode
  2. Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function
  3. Avoid to use a lot of signed 1 byte/2 byte data
  4. Using the loop unrolling optimization
  5. Using the in-line expansion
  6. Use of standard library inline expansion
  7. At external access, need to enough review of external access frequency and internal operating freque...
  8. Key point that execution speed is improved by understanding the difference of internal RAM
  9. Others


Using the 20bit address mode

When using a lot of external variable, execution speed may become slow at location of loading, because many instructions, which is loaded 32 bits address, is occurred. For this countermeasure, if the code or the data is possible to locate into RAM/ROM, which is located to 20 bits address space (0x0 to 0xFFFFF), the set of 20 bits address mode (-K shortaddress option) is recommended.


Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function

LD/ST instruction is possible to use FP relative address. However the offset, which is possible to specify, is in maximum -512 to +508 (at 4 bytes type), because of restriction of 16-bit instruction length. Therefore when local variable area, which is exceeded 512 bytes, is used, the operation in order to calculate stack address is increased, and code size is larger and access efficiency is decreased.
So in order to not exceed 512 bytes for number of stack use of function, code size is reduced and access efficiency is improved by adjusting number of local variable.

Number of stack use for each function is possible to confirm with SOFTUNE C/C++ Analyzer.

(Note) When local variable is the type of 2 bytes or 1 byte, the offset, which is possible to specify, is -256 to 254 or -128 to 127 for each type. Therefore the size, which is possible to generate of effective code, is different.

                                                                                                                                                                                                     
[C source][In case of -520 for offset][In case of -4 for offset]
(at using larger size than above mention)(at using the size within above mention)
a=10;LDI#10 R0LDI#10 R0
LDI#-520 R13STR0 @(FP,-4)
STR0 @(R13,FP)
---------------------------------------------------
8 byte4 byte


Avoid to use a lot of signed 1 byte/2 byte data

FR architecture does not have load instruction of signed data. Therefore when loading signed 1 byte/2 bytes data, sign expansion is needed after loading. When using a lot of signed 1 byte/2 bytes data, code size is increased at comparing as unsigned data.
So code size is reduced and access efficiency is improved by using unsigned type as possible.

(Note) For Softune Compiler char type is use as unsigned char type. Therefore char type is possible to use as it is.

                                                                                                                                                                                                                                                                                                                                                                                    
[C Source][In case of signed char type][In case of char type]
a=b+c;LDI:20#_b, R12LDI:20#_b, R12
LDUB@R12, R0LDUB@R12, R0
EXTSBR0LDI:20#_c, R12
LDI:20#_c, R12LDUB@R12, R1
LDUB@R12, R1ADDR1, R0
EXTSBR1LDI:20#_a, R12
ADDR1, R0STBR0, @R12
LDI:20#_a, R12
STBR0, @R12
--------------------------------------------
24 byte20 byte


Using the loop unrolling optimization

Loop-unrolling optimization is improved of execution speed by reducing number of loop.

Review loop unrolling at necessary part. Show the example of after and before unrolling as following.

                                                                                                                                                           
[Before unrolling]
for(i=0;i<6;i++){ a=0;}
[After unrolling]
for(i=0;i<6;i+3){
a=0;
a[i+1]=0;
a[i+2]=0;
}

But object size is increased when this method is used.


Using the in-line expansion

Inline expansion optimization is expanded the process of function for call ahead instead of function call to defined function in C source. This method is improved execution speed as implementing as possible. But this object size is increased for this method. In case of object size priority, this optimization is not recommended.
(Not use -xauto option, -x option, #pragma inline, inline type qualifier (only C++), when object size is taken care.)

And for the products with Cash, inline expansion is needed to specify, because Cash miss may be increased.


Use of standard library inline expansion

Standard library expansion replaces to standard function of higher speed, which inline expansion of standard function and same operating is performed, with recognizing the operating of standard function.
(In case of object size priority, control standard library inline expansion by specifying standard library inline expansion control (-Knolib)).


At external access, need to enough review of external access frequency and internal operating frequency.

Generally it is considered as high instruction process ability for large internal frequency.
But at external bus access, instruction process ability may be not improved always even high internal operating frequency, because it is more depended on access time of device, which is connected to bus (for example, it is Flash memory, which is stored the instruction for external ROM products and others.)

Show the example of benchmark result for MB91151A.
As know from ability analysis for following each operating frequency, for internal 32MHz/external 16.5MHz and internal 25MHz/external 25MHz the later is found higher ability. (And if higher Cash hit rate at using Cash, of course the ability for the products of higher internal frequency is improved because of higher process speed. Therefore this case is not meet to above mention.)

Example: Relation between instruction process speed vs internal/external operating frequency for MB91151

Test Program outline

                                                            
test1structure / branch / substitution
test2initialization of one dimension array / branch / substitution
test3initialization of two dimension array / branch / substitution

                                                                                                                                                                                                                                                                                                                                                                                                                                               
Cache OFF70nsFLASH Operating60nsFLASH OperatingSTACKDATACODESpeed comparison (unit:usec)
test1test2test3total
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH580.219.5666.41266.1
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH557.318.9659.31235.5
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH386.913.0445.3845.2
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH371.612.6446.8831.0
MB91151/Size priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH386.913.0443.1843.0
MB91151/Speed priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH371.612.6444.6828.8

                                                                                                                                                                                                                                                                                                                         
Cashe OFF70nsFLASH Operating60nsFLASH OperatingSTACKDATACODEMIPS (Dhrystone1.1)
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH3.63
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH3.05
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH5.25
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH4.57
MB91151/Size priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH7.00
MB91151/Speed priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMIn-FLASH6.09

                                                                                                                                                                                                                                                                                                                                                                                                                                               
Cashe ON70nsFLASH Operating60nsFLASH OperatingSTACKDATACODESpeed comparison (unit:usec)
test1test2test3total
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH474.42.5107.9584.8
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH488.72.3107.4598.4
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH318.62.3102.9423.8
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH326.92.2102.3431.4
MB91151/Size priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH321.23.0125.8450.0
MB91151/Speed priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH328.52.8124.4455.6

                                                                                                                                                                                                                                                                                                                         
Cashe ON70nsFLASH Operating60nsFLASH OperatingSTACKDATACODEMIPS (Dhrystone1.1)
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH30.35
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 1wait)F-RAMF-RAMEx-FLASH25.95
MB91151/Size priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH30.35
MB91151/Speed priority (CPU33MHz Ex-FLASH16.5MHz 0wait)F-RAMF-RAMEx-FLASH25.95
MB91151/Size priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMEx-FLASH30.35
MB91151/Speed priority (CPU25MHz Ex-FLASH25MHz 1wait)F-RAMF-RAMIn-FLASH25.95


Key point that execution speed is improved by understanding the difference of internal RAM

In order to implement higher execution speed it is effected to locate each area in chip as possible.
For FR series, each RAM is connected to some bus. Therefore it is needed to understand the difference of bus in order to understand the difference for each RAM. Introduce bus structure for FR series as following.
FR core has independent instruction bus and data bus. (Hard architecture) I-bus is bus for instruction, and D-bus is bus for data. F-bus is Princeton bus, which is possible to use instruction and data. It is connected to I-bus and D-bus via bus converter. RAM for each bus is called as I-RAM、D-RAM and F-RAM.

  • I-RAM is possible to execute of 1 cycle instruction by locating instruction (CODE area).
  • D-RAM is possible to access with 1 cycle by locating data (DATA/STACK area and others) Of course D-RAM is impossible to locate CODE area.
  • F-RAM is possible to locate both CODE/DATA/STACK area and others.
      But it is taken more cycles than I-RAM/D-RAM because of passing bus converter.

At locating with linker, execution speed is improved by considering above RAM area.

understanding_the_difference_of_internal_RAM.gif


Others

Attention to Memory copy

There is the case that the code such like structure substitution, structure argument, structure return value and strXXX/memXXX is affected to execution speed. For these codes, review whether compiler output code has any problem if possible. Especially strXXX/memXXX library is made generally. Therefore if location boundary of data, which is handled, and data length is known already, it is higher speed by describing special process but not calling library. Of course it is higher speed by copying with word unit but not byte unit.

Avoid to call of library at executing

For calling of library at executing, avoid the data type (long type and others), which is impossible to operate with actual instruction, as possible.

Locate structure member, which number of reference is large, to head.

Access of structure member is fixed actual location address by calculating head address + offset.
Head member is not needed the calculation because of offset=0. When there is high member for static access frequency, review whether it is possible to locate to head.

Void of function, which is returned structure.

The function, which is returned structure, is occurred structure transfer into work area. address of structure to substitution destination is handled with argument, and it is possible to make void function by directly substituting.

Within 4 argument

Within 4 argument, it is not needed the code for stack access because of handling with resister. Therefore execution speed is improved. When there is argument, which is handled uselessly, review to reduce it.


304 Views
Contributors