Announcements

Register now for the most anticipated Asia Pacific Power Seminar 2022

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

USB superspeed peripherals Forum Discussions

mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

We've created a custom board containing a CYUSB3014-BZXC with a GPIF interface attached to a Spartan 7 FPGA. My GPIF state machine is very similar to the SlaveFifo design desribed in AN65974, with two primary differences:
1. I've removed the SLCS signal. We use a GPIO pin to tell the FPGA when the host application has enabled the interface. The FX3 still provides FLAGA and FLAGB and they are connected to Current_Thread_DMA_Ready and Current_Thread_DMA_Watermark. The FPGA controls the remaining signals (SLRD#, SLOE#, SLWR#, PKTEND#, and A[1:0]).
2. The FX3 sources PCLK but it is still a slave design. This was done so that a separate GPIF design can be used to configure the FPGA in Slave SelectMap mode.

When the host initiates a data transfer the FX3 firmware creates a multichannel DMA between USB OUT endpoint 3 and GPIF threads 2 and 3 and allocates two 16K buffers if the transfer involves OUT data. If the transfer involves IN data then the firmware creates a multichannel DMA between USB IN endpoint 4 and GPIF threads 0 and 1 and allocates two 16K buffers. If the transfer is bi-directional then both DMA channels are created and the buffer sizes are the same as in the unidirectional case.

When the host wants to transfer data it first sends a command to the FX3 to tell it to enable the GPIF interface. This will cause the PIB clocks to be configured and the state machine to be loaded, and then the enable signal to the FPGA is driven to a '1'.

The host test application can be configured to perform OUT only transfers, IN only transfers, or bidirectional transfers with data verification in a loop. Each time the host enters this loop it sends a 16 byte command to the FPGA that tells it which direction the transfer will be, and if data is to be written too/read from a 16K BRAM FIFO that exists inside the FPGA or data read from the FPGA Is simply discarded, and data written from the FPGA is from a counter instead of the FIFO. The command includes OUT and IN byte counts so that the FPGA knows how much data to transfer in each direction.

After the command is downloaded the bulk data transfer begins. For OUT only transfers the data is read by the FPGA over the Slave Fifo interface and discarded. For IN only transfers the FPGA will writes the value of an internal counter to the FX3. For bidirectional transfers the FPGA first reads a buffer of data from the FX3 (address 2 or 3, alternating on each FX3 read, always start on address 2) into the BRAM FIFO and then writes the BRAM FIFO to the FX3 (address 0 or 1, always starting on address 0, alternates on each FX3 write, always starting on address 0). The process is repeated with addresses toggling between each transfer until all data is transferred.

Once the bulk data transfer completes several smaller transfers are performed to read statistic registers that are inside the FPGA. Those statistics include how many times the FPGA thinks that it asserted SLOE, SLRD, SLWR, and PKTEND. They also include the last 32 OUT addresses, the last 32 IN addresses, and in the case that the transaction is aborted by the host due to timeout, it also includes the address that the FPGA thinks that it will read or write next and the count of words presently in the BRAM FIFO.

If the test is supposed to be performed for more than one iteration then it loops.

Testing has shown that unidirectional OUT transfers and unidirectional IN transfer work reliably at 100 MHz. Bidirectional transfers work reliably at 89.6 MHz, but will fail randomly when PCLK is set to 100 MHz (100.8).

When the bidirectional transfers fail the statistics collected from the FPGA show that the FPGA last received a buffer from the FX3 and wrote it into the BRAM FIFO and is in a state where it's waiting for FLAGA to indicate that there is an IN buffer available to write data to. The count of words in the BRAM FIFO is equal to 4096 and I can see that during the last 32 OUT's and IN's there have been now repeated READS or WRITES to the same address two times in a row, meaning that it correctly toggles the address each time. I can also see that the FPGA logic is waiting for an IN buffer on the correct address. However, it never receives one.

In the FX3 firmware I've setup a DMA callback on the producer endpoint (consumer was never signaled) of both the OUT and IN DMA channels. Each time the producer event occurs I increment the count of bytes transfered by 16K, but never exceeding the total size of the transfer, which may not be a multiple of 16K. The host application receives these counts after the transaction completes, or is aborted (fails). From this I can see that the FX3 thinks that it received one less 16K buffer than the FPGA thinks that it sent. This makes me think that the signal timing may be off and that a SLWR is being missed by the FX3. Is there a way for me to determine how many bytes are stored in the DMA buffer associated with socket 0 and socket 1 so that I can see if a word was missed?

Some other observations:
Our firmware outputs the USB PHY and Link error counts over UART once per second. During an OUT only transfer the error count stays at 0 when I run a large number of transfers. This is also true for an IN only transfer. However, when I perform the bidirectional transfer I see PHY errors reported each second. Why are there PHY errors during bi-directional transfers but not the uni-directional transfers?

1000 ms PHY errors = 12
1000 ms LINK errors = 0
1000 ms PHY errors = 6
1000 ms LINK errors = 0
1000 ms PHY errors = 11
1000 ms LINK errors = 0
1000 ms PHY errors = 12
1000 ms LINK errors = 0
1000 ms PHY errors = 6
1000 ms LINK errors = 0
1000 ms PHY errors = 12
1000 ms LINK errors = 0
1000 ms PHY errors = 12
1000 ms LINK errors = 0
1000 ms PHY errors = 6
1000 ms LINK errors = 0
1000 ms PHY errors = 10
1000 ms LINK errors = 0
1000 ms PHY errors = 11
1000 ms LINK errors = 0
1000 ms PHY errors = 12

I've registered a callback function to see if CYU3P_USBEP_SS_RESET_EVT occurs, but it does not.

Larger (160MB) bidirectional transfers fail much more frequently than small (16MB).

Our firmware sets the OUT and IN endpoint burst size to 16 by default. If I change it to 1 (disable burst) on both of the endpoints then bidirectional transfers NEVER fail, even at 100 MHz. I still see the same number of PHY errors that I did at 100 MHz so I'm not sure that the PHY error count is what ultimately causes IN transfers to stop.

Are the PHY errors what ultimately causes the IN transfers to stop, or is it something else?


Any suggestions on how to further debug this problem? I tried to capture the bus traffic with the Ellisys EX350 but that's very difficult because it seems to prevent the data transfer from failing, or at least makes it fail much less often, and the amount of data is so large that the capture stops.

Thanks,
Michael

0 Likes
17 Replies
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Please find my comments below

 Is there a way for me to determine how many bytes are stored in the DMA buffer associated with socket 0 and socket 1 so that I can see if a word was missed?

>> To check the bytes stored in the DMA buffer when producer event is received, you can check for buffer_p.count on receiving the producer event.

For the PHY errors when both IN and OUT transfers are done simultaneously, please refer to this KBA  Simultaneous IN/OUT USB Transfers in EZ-USB® FX3™ ... - Infineon Developer Community  

Also, please let me know if CY_U3P_DMA_CB_ERROR is seen when the transfers stop. Please register this event in the DMA callbacks before checking in the DMA callback.

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I occasionally see the CY_U3P_DMA_CB_PROD_EVENT event signaled when buffer_p.count is zero but according to the host application all of the data has gone through. 

I registered notifications for CY_U3P_DMA_CB_XFER_CPLT, CY_U3P_DMA_CB_PROD_EVENT, CY_U3P_DMA_CB_ABORTED, and CY_U3P_DMA_CB_ERROR. I'm seeing CY_U3P_DMA_CB_ERROR before the host sends the cancellation request.

I was hoping to look at buffer_p.count when receiving the CY_U3P_DMA_CB_ABORTED event, as this event is signaled after the host send a cancellation command to the device once it hasn't received any data in a long time. However, the callback function is invoked with NULL passed for the 3rd parameter. The definition of CyU3PDmaMultiCallback_t has a comment that states that the pointer will be valid for event types CY_U3P_DMA_CB_RECV_CPLT and CY_U3P_DMA_CB_PROD_EVENT so I guess that's why NULL is passed in during abort.

I understand that bidirectional transfers do not truely occur simultaneously but KBA94607 doesn't answer my question about why there would be PHY errors when I initiate bulk OUT and IN transfers at the same time.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Please let me know for which channel do you see buffer_p.count is zero

Also, can you also check if there are any Link errors along with the PHY errors when IN and OUT transfers are done simultaneously?

Kindly, let me know which host controller is being used and are the transfers synchronous or asynchronous?

Is it possible to add some delay in the host application between IN and OUT transfers and check if there is some improvement in PHY and LINK errors? 

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I'm seeing buffer_p.count is zero on the multi channel DMA that connects the GPIF to the USB IN endpoint 4.

I don't see any link errors, just PHY errors.

This problem occurs in Windows 10 with ASMedia USB 3.1 eXtensibile Host Controller and Intel USB 3.0 exTensible Host Controller. The only other hardware that I have access to that runs Windows has an ASMedia 3.0 host controller, but I haven't tried the software on there. I also have access to a M1 based MAC Mini. I've run the software on the MAC, and thus far haven't seen this problem, but the throughput is much lower due, which I think is due to us using a libusb based driver instead of a kernel driver like we do in Windows.

The transfers are asynchronous. The code that manages the transfers knows nothing about the protocol employed on the other end (GPIF slave FIFO, JTAG, SPI, etc) just that it needs to transfer X bytes out and Y bytes in. The thread that performs the transfers on the host side enters a loop where it initiates the transfers (asynchornously using WriteFile and ReadFile calls with overlapped structures) with each having a maximum chunk size of 32MB (this could be redefined, but has always been 32MB). The thread then waits (WaitForMultipleObjects) for the transfer to complete or for a semaphore indicating a transfer timeout (30 seconds by default). Once an OUT or IN transfer completes another transfer is initiated as needed, again with a chunk size limit of 32MB. This same code is used by all of our USB devices, not just the FX3 devices.

It's much easier to modify the FPGA logic to insert a delay between transfers than it is to change what happens on the host side since scheduling is largely up to the host controller. My FPGA logic already contains a delay state to allow for the DMA flags to become valid. I can increase the width of the counter and change the count value that I assign before entering that state and see if it makes any difference, other than reducing throughput.

I'm not sure if you noticed, but I also tried running at 89.6 MHz and found that there are no transfer errors, just slower throughput. The same is also true for 100.8 MHz when I change the endpoint burst size from 16 to 1 (burst disabled). throughput is generally better when burst is enabled and the clock is run at 89.6 MHz than it is at 100.8 MHz with no burst.

Is there a way to determine what causes the CY_U3P_DMA_CB_ERROR?

Thanks,
Michael

0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I have a state in my FPGA logic where I wait 8 clock cycles after changing FIFOADDR to allow the DMA flags to become valid. According to the datasheet, and AN65974, the FPGA Logic should be able to sample the DMA flags on the 3rd rising edge after FIFOADDR is changed.

I modified my FPGA logic to wait for 5000 clock cycles whenever I change from reading an OUT buffer to writing an IN buffer, or vice versa, and after doing so I no longer receive failures or CY_U3P_DMA_CB_ERROR. I reduced the delay to 500 cycles, then 50 cycles, and found it still working. I then reduced it to 10 cycles and found that I was getting random failures again along with CY_U3P_DMA_CB_ERROR. Increasing the delay to 20 cycles seems to have eliminated the failures again.

So what I think was happening is that the FX3 was indicating a buffer was available prior to it actually being available, and the FPGA would write the buffer and this would cause a DMA error. I don't know why the FX3 prematurely indicates the availability of an IN buffer in this scenario, but maybe it has something to do with how heavily the DMA controller is taxed.

As I mentioned previously, disabling super speed endpoint burst also works around the problem. Reducing the clock frequency to 89.6 MHz is another option for working around it. However, these both reduced the throughput by 10-30MB/sec. Changing the FPGA logic to wait 20 cycles before reading the DMA flags has had very little impact on the throughput.

I think this issue with the DMA ready flag asserting prematurely is either a silicon bug or a bug in the FX3 SDK.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

 

Hello,

Thank you for the details

Please let me know if you were using dedicated DMA flags or current thread DMA flags. Are you using the default AN65974 state machine? 

Please try to increase the DMA buffer size or DMA buffer count allotted to DMA channels. Based on the observations shared by you, it seems that the DMA buffers are less for the channels. Please confirm if you are monitoring both partial flag and DMA ready flag before starting the transfers from FPGA

We have Loopback functionality in AN65974 which is similar to the test you are doing. We haven't seen issue in bidirectional transfers in that case.

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I'm using current thread DMA flags in my design and no, I'm not using the default AN65974 state machine because we don't use SLCS.

What do you mean by "DMA buffers are less for the channels"? I'm allocating 16384 byte buffers with a count of 2 for each DMA channel. I'll see if there's room to increase the size but I don't think it's possible to address the buffers if I increase the count because FIFOADDR is only 2 bits. If I can increase the buffer size to 32K should I also increase my BRAM fifo to 32K or are you wanting me to transfer half a 32K buffer into BRAM and ten transfer half out to the IN DMA buffer, and then consume the remaining 16K?

Initially the DMA ready flag is polled (now 20 clocks after changing FIFOADDR instead of 😎 and when it indicates that a buffer is ready the FPGA advances to the first state of a download or upload operation. The partial flag is also checked so that we know if we can transfer a full buffer or only 16 bytes.

I haven't looked at the HDL for AN65974 but will take a look to see how it's different.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Is it possible to change the DMA buffer size to 32 KB and 2 counts for each buffer?

Please note that FIFOADDR is for addressing the GPIF threads and not the DMA buffers. If the DMA buffer size is increased to 32K from 16 K the USB will consume the 32K data accordingly. No change is needed in the firmware other than increasing the DMA buffer size.

As mentioned in AN65974, both the flags are monitored before starting the transfer and DMA watermark is monitored to stop the data transfers

Please let me know if any queries on this

 

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I tried setting the the buffer size to 32K and kept the buffer count set to 1 (this is a multi channel DMA so it automatically allocates twice as many buffers as you ask for) which resulted in 128K of RAM usage. The 32K buffers reduce the frequency at which the problem occurs when the number of delay clocks between setting FIFO address and reading the DMA flags is less than 20 clocks, but the problem still occurs. Increasing the number of cycles between setting FIFOADDR and reading to the flags to >= 20 works around the issue, just as it did with 16K buffers.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Thank you for the details.

I tried setting the the buffer size to 32K and kept the buffer count set to 1 (this is a multi channel DMA so it automatically allocates twice as many buffers as you ask for)

>>Please let me know if you are using the Multi DMA channel for both reading and writing from/to  FPGA i.e. 4  GPIF sockets for 2 DMA channels.

If yes, then the issue seems to be due the host consuming data slower than the rate as which FX3 gets data from FPGA.

In this case, you can try to increase DMA buffer size or monitor partial flag and DMA ready flag both before starting the data transfer

 

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

Yes, I'm using multi-channel DMA for both the reading and writing the FPGA and have 4 GPIF sockets for those 2 DMA channels.

Why do I need to monitor both the partial and DMA ready flag before advancing to the initial transfer state? When I fill one of the IN buffers the GPIF (producer) commits the buffer and it should go to the USB endpoint for consumption. If the host consumes data slowly will the DMA ready flag indicate that the buffer is available even though it's still being consumed by the USB peripheral?

If this were true then I'd expect to see consistent failures when using a USB 2 cable, even with PCLK running at 90 MHz. However, no such failures occur. I don't recall seeing failures at USB 2 speed with PCLK = 100 MHz either. It's only with PCLK = 100 MHz and USB 3 that this happens, and only if I don't wait more clock cycles between setting the FIFOADDR and reading the DMA ready flag.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello,

Apologies for the confusion

The DMA Ready flag should be used for the start of the transfers and DMA Watermark flags are used to stop the transfer when the external processor or FPGA sends random bytes of data or doesn't have any counting mechanism.

If the host consumes data slowly will the DMA ready flag indicate that the buffer is available even though it's still being consumed by the USB peripheral?

>> The DMA flag will be asserted when the DMA flag is empty (data to be written to FX3 from FPGA) or DMA buffer is full (data to be read from FX3).  After then FIFOADDR switch the DMA flags will indicate the status of the new DMA buffer from addressed socket.

As you are using 100MHZ clock and 32  bit GPIF interface, please confirm if you have configured the system clock as 403.2 MHz

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

Ok that makes sense. I'm monitoring the DMA ready flag at the beginning of the transfer and using the watermark to know when to stop the transfer.

I am setting the system clock to 403.2 MHz.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Please share the interface traces 1) normal case 2) when the issue is seen.

Also, please let me know if it is possible for you to do the below tests

- Program your firmware with CYUSB3KIT-003 kit and check if the issue is reproducible

- Try porting AN65974 FPGA code to your FPGA and check if the issue is reproducible

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

It's possible to do both of those things but I discussed  it with my manager and we aren't sure it's worth spending the additional time and effort on it since we've identified that waiting 20 clock cycles between setting FIFOADDR and ready the DMA ready flag solves the problem. I don't have time to do it this week, but will likely try back porting the FPGA design to one of our other system boards that has an FMC connector so that I can try it with the CYUSB3KIT-003.

Thanks,
Michael

0 Likes
Rashi_Vatsa
Moderator
Moderator 500 solutions authored 1000 replies posted 750 replies posted
Moderator

Hello Michael,

Thank you for the update.

Glad to hear that putting a delay before reading or writing is solving the issue for you.

Please let us know results if you are doing the suggested tests

 

Regards,
Rashi
0 Likes
mialc_1106291
Level 4
First solution authored First like given 25 sign-ins
Level 4

Hi Rashi,

I ported our FPGA logic to one of our system boards that has an Artix 7 FPGA and an FMC connector. Unfortunately, I can't get it to work reliably if PCLK is faster than 67.2 MHz. If I run PCLK any faster than that it fails unidirectional transfer. I think this is a timing issue since the maximum frequency at which I can successfully constrain the IO logic is 30+ MHz slower than what I'm able to do on our custom board. In conclusion, I can't replicate the same test conditions or behavior with the CYUSB3KIT-003 plugged into an FMC carrier.

I will look into porting AN65974 to our custom board which contains both the FX3 and the FPGA.

Thanks,
Michael

0 Likes