PSOC 4 32 Bit Multiplier

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
ETRO_SSN583
Level 9
Level 9
250 likes received 100 sign-ins 5 likes given

What are the register names and programming model, example, of using the onboard 32 bit multiplier in the M0 ?

I looked in Architecture and register TRMs, no luck.

I want to convert a software math based FFT with this feature. Note I believe the CMSIS

library does not use this HW feature, so cripples the FFT performance that using it would present.

 

Regards, Dana.

 

 

Regards,mDana.

0 Likes
1 Solution

Dana,

please make sure that the compiler is in the Release mode, with optimization level set  for Speed AND link time optimization Enabled.

Build_settings.PNG

PS. I am working on the Goertzel filter. Calling a procedure of type:

 

int32 Q0,Q1,Q2,coef;
... 
void Filter_Write(int16 value) {
   Q0 = (int32) Q1 * coef - Q2 + value;
   Q2=Q1;
   Q1=Q0;
}

 

consistently takes 12 clocks, including staging a procedure. And using 64-bit version of it: .int64) Q1 * coef... , takes 16 clocks, which is consistent with 64-bit emulation of the multiplication using four 32-bit instructions 

 

View solution in original post

0 Likes
13 Replies
PandaS
Moderator
Moderator
Moderator
250 replies posted 100 solutions authored 5 likes given

Hi @ETRO_SSN583 ,

I am trying to understand your query. Is the input array is a floating point array? In that case could you try indicating the value with the right identifiers like (1f,2f,...) which should force the operation by hardware using HARDFP identifier in compiler. May be you could then compare the difference in assembly level instructions in  disassembler.

Thanks and regards,

Sobhit

0 Likes
ETRO_SSN583
Level 9
Level 9
250 likes received 100 sign-ins 5 likes given

I created a new project, PSOC 4, and doing a search cannot find any

HARDFP assignment or identifier in the basic project configuration ?

Searching the entire workspace.

 

I want to take two variables, load appropriate registers, and invoke the single cycle

multiplier in the PSOC 4 and access the result.....

 

Regards, Dana.

0 Likes
BiBi_1928986
Level 7
Level 7
First comment on blog 500 replies posted 250 replies posted

Hello.

Did you add the math package to the Build Settings?  KBA93076
Using Math Functions in PSoC® Creator™ for the PSoC 4 or PSoC 5LP GCC Compiler (infineon.com)

0 Likes

Yes, still no HARDFP after performing basic build.

 

Regards, Dana.

0 Likes

Dana, I believe that on Cortex M0 the standard multiplication( int x int) is a single cycle when the output is also int.

int32 a,b;

int32 r = (int32) a x b.

And it takes 4 cycles to get int64 result

int32 a,b;

int64 r = (int64) a x b.

There are no SIMD intrinsic  instructions on M0 processor (M4+ only)

 

Do you need full spectrum FFT or Goertzel might suffice?

0 Likes
ETRO_SSN583
Level 9
Level 9
250 likes received 100 sign-ins 5 likes given

Running a 4M board, 48 Mhz clk, loop to do 1000 32 bit mults, got

560 uS, or 560 ns / multiply. Also turned off nano lib, tried it on and off,

no change.

 

So still not getting to single cycle type speed.

I will look at ASM shortly for clues.....

 

Not looke dt Goertzel, new to me, will investigate. But still I want

access to that multiplier, its a strong feature that I want to tell

others about, and standard FFT would be a good example.

 

Regards, Dana.

0 Likes

Dana,

please make sure that the compiler is in the Release mode, with optimization level set  for Speed AND link time optimization Enabled.

Build_settings.PNG

PS. I am working on the Goertzel filter. Calling a procedure of type:

 

int32 Q0,Q1,Q2,coef;
... 
void Filter_Write(int16 value) {
   Q0 = (int32) Q1 * coef - Q2 + value;
   Q2=Q1;
   Q1=Q0;
}

 

consistently takes 12 clocks, including staging a procedure. And using 64-bit version of it: .int64) Q1 * coef... , takes 16 clocks, which is consistent with 64-bit emulation of the multiplication using four 32-bit instructions 

 

0 Likes
ETRO_SSN583
Level 9
Level 9
250 likes received 100 sign-ins 5 likes given

Thanks  ,  I am now getting 125 uS for 1000, 125 ns / multiply. But this is running inside a for loop

so those clks adding to time.  Much better, but did not use the HW multiplier I am thinking.....

Note when I do a single cycle measurement, timed off fall of init pulse to rising edge of finish pulse,

I get .51 uS. Still far away from single clock cycle of ~ 20 nS. Odd though 4 X slower..... Then looking

at .lst file do not see muls anymore, gah !

 

 

 

    while ( 1 ) {
        
        Pin_1_Write(1);         // Trigger for start 1000 multiplies
        Pin_1_Write(0);
        Pin_1_Write(1);
        Pin_1_Write(0);

        
        for(iCntr = 0; iCntr < 1000; ++iCntr) {

            /* Place your application code here. */
           
            zz = x * y;
    
        }
        Pin_1_Write(1);         // Trigger end of 1000 multiplies
        Pin_1_Write(0);
        zzsav = zz;
        
        CyDelay(1);
    }

 

 

 

 

Note .lst file using muls, thats not the HW multiplier invoked instruction ?

 

 

 

  38:main.c        ****         
  39:main.c        ****         for(iCntr = 0; iCntr < 1000; ++iCntr) {
  78              		.loc 1 39 0
  79 003c 1223     		movs	r3, #18
  80 003e FB18     		adds	r3, r7, r3
  81 0040 0022     		movs	r2, #0
  82 0042 1A80     		strh	r2, [r3]
  83 0044 0AE0     		b	.L2
  84              	.L3:
  40:main.c        **** 
  41:main.c        ****             /* Place your application code here. */
  42:main.c        ****            
  43:main.c        ****             zz = x * y;
  85              		.loc 1 43 0 discriminator 3
  86 0046 FB68     		ldr	r3, [r7, #12]
  87 0048 BA68     		ldr	r2, [r7, #8]
  88 004a 5343     		muls	r3, r2
  89 004c 7B61     		str	r3, [r7, #20]
  39:main.c        **** 
  90              		.loc 1 39 0 discriminator 3
  91 004e 1223     		movs	r3, #18
  92 0050 FB18     		adds	r3, r7, r3
  93 0052 1222     		movs	r2, #18
  94 0054 BA18     		adds	r2, r7, r2
  95 0056 1288     		ldrh	r2, [r2]
  96 0058 0132     		adds	r2, r2, #1
  97 005a 1A80     		strh	r2, [r3]
  98              	.L2:
  39:main.c        **** 
  99              		.loc 1 39 0 is_stmt 0 discriminator 1
 100 005c 1223     		movs	r3, #18
 101 005e FB18     		adds	r3, r7, r3
 102 0060 1B88     		ldrh	r3, [r3]
 103 0062 074A     		ldr	r2, .L5
 104 0064 9342     		cmp	r3, r2
 105 0066 EED9     		bls	.L3
  44:main.c        ****         }

 

 

 

 

 

Regards, Dana

 

 

0 Likes
ETRO_SSN583
Level 9
Level 9
250 likes received 100 sign-ins 5 likes given

Here I see NXP in their core enables the HW multiplier via their compiler options

configuration :

 

https://community.nxp.com/t5/LPCXpresso-IDE-FAQs/Use-of-Cortex-M0-M0-multiply-instructions-on-LPC43x...

 

So how do we do same ?

 

 

Regards, Dana.

0 Likes

I see this table on various instructions for Cortex M3

https://os.mbed.com/media/uploads/4180_1/cortexm3_instructions.htm

https://s-o-c.org/cortex-m0-multiply-cycles/

 

I would say that 120ns ~ 6 cycles is pretty good, assuming the loop increment eating away 2 cycles.

However, the results can be misleading, as compiler tends to optimize out (skip) any calculations which results are not utilized. So for test purposes the numbers in the cycle should change upon iteration, and the result should be used (e.g. printed out)  

0 Likes

I am getting mixed results.

So I have compiler optimization turned off, and execute the following code

once. I get 2.76 uS for a cycle of 276 nS / multiply.

 

At 48 Mhz cpu clock is ~ 20 nS, the LDR and STR are 2 cycles,

so total is 6  for load/store+ 1 for muls = 7 cycles. or 140 nS.

 

Use the - edge 2'ond pulse at start to rising edge of pulse at end. 200 Mhz scope.

 

So short answer is I should be satisfied,  but 2X is the discrepancy......

 

 

        Pin_1_Write(1);         // Trigger for start 10 multiplies
        Pin_1_Write(0);
        Pin_1_Write(1);
        Pin_1_Write(0);
           
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;
        zz = x * y;  
        
        Pin_1_Write(1);         // Trigger end of 10 multiplies
        Pin_1_Write(0);

 

 

 

 

 223 0044 7B69     		ldr	r3, [r7, #20]
 224 0046 3A69     		ldr	r2, [r7, #16]
 225 0048 5343     		muls	r3, r2
 226 004a FB60     		str	r3, [r7, #12]

 

 

Regards, Dana.

0 Likes

Hello.

The 7 cycles for multiply is correct as you analyzed it.   But, there are also additional 20 cycles for the Pin_1_Write(1).  Since you need to trigger on just a rising edge, the port bit would change state a few cycles less than 20 cycles.  See AN86439 for the disassembly of Pin_1_Write().

If you have access to a 4-channel scope or logic analyzer, I found it useful to not only trigger on the GPIO, but to also bring HFCLK/2 to a GPIO.  That way you can visually count the cycles and you'll only need to perform 1 multiply between GPIO triggers.

0 Likes

To improve on Pin_1_Write(), you can use the macro's:
CY_SYS_PINS_SET_PIN( portDr, pin) and CY_SYS_PINS_CLEAR_PIN(portDr, pin)
These only take 8 cycles each (helping to make the begin/end-of-multiply a little more precise to measure).

Another approach is to repeat the multiply about 100 iterations (not in a FOR Loop) to help negate/reduce the effects of GPIO writes.  You've already done similar with 10 iterations.  Now just copy/paste that 10 more times.

I would inline the assembler code as a macro.  The ARM is spending too much time retrieving variables off the stack otherwise.

0 Likes