SDK-3.7.0: Detect memory leak while testing https_client snip code

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
AxLi_1746341
Level 7
Level 7
10 comments on KBA 5 comments on KBA First comment on KBA

Add while (1) { } around the wiced_https_get() call, then the device will

run to out-of-memory in a few minutes. then wiced_https_get() always fails.

Any chance Cypress AE team can take a look? I think this is an important issue.

I'm not sure if the memory leak is in TLS library or other part.

Axel

0 Likes
55 Replies
AxLi_1746341
Level 7
Level 7
10 comments on KBA 5 comments on KBA First comment on KBA

mwf_mmfae

Now I'm pretty sure the memory leak is in TLS library.

The leak happens in result = ssl_handshake_client_async( &tls_context->context );

Please provide the fix ASAP.

0 Likes

Hi axel.lin​, have you tested this on previous SDK versions such as 3.5.2?

0 Likes

Hi xavier@candyhouse

No, I didn't test this on 3.5.2.

(We currently still using 3.1.2 SDK which is much more stable and we want to upgrade to latest SDK if possible.)

0 Likes
Anonymous
Not applicable

Hi axel.lin

In 3.5.2 there seem to be an issue with TLS and memory. As you reported for 3.7.0 an issue with TLS connection, I made some tests, and here is what I get (after adding some debug print in case of sbrk failure):

openssl s_client -connect 192.168.1.32:443 -status -showcerts -CAfile ca-chain.cert.pem -tls1 -debug

LOG_DEBUG: [event][116] Executing event 20002700 (801c731) for 0

LOG_DEBUG: [eci] Accepting new connection from 192.168.1.20 on 57985

Error starting TLS connection

LOG_ERROR: [eci] accept failed 1b66

LOG_DEBUG: [event][127] Executing event 20002700 (801c731) for 0

LOG_DEBUG: [eci] Accepting new connection from 192.168.1.20 on 57985

Error starting TLS connection

LOG_ERROR: [eci] accept failed 1b66

LOG_DEBUG: [event][135] Executing event 20002700 (801c731) for 0

LOG_DEBUG: [eci] Accepting new connection from 192.168.1.20 on 57985

Error starting TLS connection

LOG_ERROR: [eci] accept failed 1b66

heap increment of 4096 would overflow

Heap state:

  allocated: 52892

  free: 392

  total size: 57372

  current size: 53284

WICED/security/BESL/host/WICED/wiced_tls.c:610: assertion failure in wiced_tls_load_key: 0 != 0

Key parse error

Which would mean there is something that doesn't get unallocated in case of error.

0 Likes

xavier@candyhouse wrote:

Hi axel.lin, have you tested this on previous SDK versions such as 3.5.2?

Hi xavier@candyhouse​,

FYI, I just tested snip.https_client on SDK-3.5.2, it does not have memory leak issue.

Axel

0 Likes

axel.lin wrote:

mwf_mmfae

Now I'm pretty sure the memory leak is in TLS library.

The leak happens in result = ssl_handshake_client_async( &tls_context->context );

Please provide the fix ASAP.

Add vik86​,

This issue makes 3.7.0 useless for applications using https/mqtts.

I do believe you need to provide an urgent fix.

axel.lin wrote:

mwf_mmfae

Now I'm pretty sure the memory leak is in TLS library.

The leak happens in result = ssl_handshake_client_async( &tls_context->context );

mwf_mmfae

I'm still waiting for the memory leak fix.

Can you check *when* will the fix be available.

If it really needs a very long time to fix this issue, I have no choice but to switch back to old SDK.

(But honestly I'm surprised a memory leak fix needs a very long time to fix)

0 Likes

axel.lin​ - Please continue to use the CY SFDC case that was setup to track this issue. I was told by the Applications Manager that you have an active case for this topic.  Note that I will continue to work on the integration of the CY SFDC system into the IoT Forum so that this type of escalation can be more automated and engaging in the future, but for now, it is a standalone system.

0 Likes
Anonymous
Not applicable

It would be helpful to the rest of us to have engagement in a public forum--please let us know about the timeline for this issue. axel is not the only one who uses tls, I have been looking forward to fixes to various issues in 3.7.0 but can't upgrade if this is a problem. Thanks to axel for being on top of it.

mwf_mmfae wrote:

axel.lin - Please continue to use the CY SFDC case that was setup to track this issue. I was told by the Applications Manager that you have an active case for this topic.

I don't get any response on CY SFDC for this case so far (after waiting yet another 2 weeks).

0 Likes

I will continue to ask the Apps team to take a look at the issue.

0 Likes

mwf_mmfae wrote:

I will continue to ask the Apps team to take a look at the issue.

You have a simple test case by using https_client, all you have to do is

just find the bad commit then it should be clear about the root cause.

Such regression should be fixed within one or two days.

I think this is not a technical issue, there must be other reason that

you cannot provide the fix so far.

CaWo_1798781
Level 3
Level 3
5 likes given First like received First like given

axel.lin​, did you have to make any modification to run the snippet?

https_client runs out of the box for 3.5.2 and 3.6.3, but in 3.7.0, get fails for me

0 Likes
dast_1961951
Level 4
Level 4
10 likes received First like received

*Bump* 

Also experiencing a memory leak connecting/disconnecting MQTT(S).

The following heap space remains after disconnecting:

bignum                   312    20026328

bignum                   312    200277e8

bignum                   180    200270b8

bignum                   180    20027000

bignum                   180    20026f48

bignum                   180    20026e90

bignum                   180    200258f8

bignum                   308    20026d58

bignum                   56     200258b8

bignum                   308    20026c20

tls                      72     20025fe8

bignum                   56     20025fa8

bignum                   308    20025e70

pubkey                   212    20025d98

x509                     940    200259e8

0 Likes

dstudejio wrote:

*Bump* 

Also experiencing a memory leak connecting/disconnecting MQTT(S).

The following heap space remains after disconnecting:

That's known issue in sdk-3.7.0.

Please test with sdk-3.7.0-3:

https://community.cypress.com/thread/7611

Still not working.  This is left over after a MQTT disconnection.

bignum                   312    200263b8

bignum                   180    20026cd0

bignum                   180    20026c18

bignum                   180    20026b60

bignum                   180    20026aa8

bignum                   180    20025510

bignum                   308    20026970

bignum                   56     200254d0

bignum                   308    20026838

tls                      72     20025c00

bignum                   56     20025bc0

bignum                   308    20025a88

pubkey                   212    200259b0

x509                     940    20025600

queue                    692    20019ac0

<extent of heap prior to mqtt connection>

stack                    1052   20023028

0 Likes

Just to clarify if it's TLS memory leak or other issue.

Can you please test snip.https_client with below modification?

Just adding while(1) to repeat sending https request. (below also add the code to print mallinfo).

in apps/snip/https_client/https_client.c:

while (1) {

    volatile struct mallinfo mi = mallinfo( );

    result = wiced_https_get( &ip_address, SIMPLE_GET_REQUEST, buffer, BUFFER_LENGTH, NULL );

    if ( result == WICED_SUCCESS )

    {

        WPRINT_APP_INFO( ( "Server returned\n%s", buffer ) );

    }

    else

    {

        WPRINT_APP_INFO( ( "Get failed: %u\n", result ) );

    }

   // to print memory usage here

   printf("arena:%5d ordblks:%5d smblks:%5d hblks:%5d hblkhd%5d usmblks:%5d fsmblks:%5d uordblks:%d fordblks:%d keepcost:%d\r\n",

        mi.arena, mi.ordblks, mi.smblks, mi.hblks, mi.hblkhd,

        mi.usmblks, mi.fsmblks, mi.uordblks, mi.fordblks, mi.keepcost);

}

Run the code for 10 minutes, it should always work if no memory leak.

0 Likes

Above doesn't compile correctly.  Any suggestions?

0 Likes

dstudejio wrote:

Above doesn't compile correctly.  Any suggestions?

Add below include file should work:

#include <malloc.h>

0 Likes
lock attach
Attachments are accessible only for community members.

I just test snip.https_client with SDK-3.7.0-3.

I'm surprised that I still got memory leak issue.

Attached my test log.

mwf_mmfae​, what's your saying?

0 Likes

I will see if I can get someone on the engineering team to look into this issue.

0 Likes
Anonymous
Not applicable

Hi mwf_mmfae,

Even in SDK 3.5.2, I am facing an issue with Reconnect to MQTT. Event when I try to do MQTT_Connect and Disconnect continuously in a while(1) I am facing issue with the system getting hung after some 5-6 loops. Can you suggest me a work around for this as I have been stuck with this issue from last 1 week.

  while(1)

{

  do

  {

  ret = aws_mqtt_conn_open( app_info.mqtt_object, mqtt_connection_event_cb );

  connection_retries++ ;

  } while ( ( ret != WICED_SUCCESS ) && ( connection_retries < WICED_MQTT_CONNECTION_NUMBER_OF_RETRIES ) );

do

{

ret = aws_mqtt_app_subscribe( app_info.mqtt_object, app_info.shadow_delta_topic , WICED_MQTT_QOS_DELIVER_AT_MOST_ONCE );

connection_retries++ ;

} while ( ( ret != WICED_SUCCESS ) && ( connection_retries < WICED_MQTT_CONNECTION_NUMBER_OF_RETRIES ) );

      shadow_close();

  wiced_rtos_delay_milliseconds(1000);

}

In my shadow close I am just calling

    mqtt_network_deinit(&(((mqtt_connection_t*)app_info.mqtt_object)->socket));

    mqtt_connection_deinit((mqtt_connection_t*) app_info.mqtt_object);

Am I doing any mistake. Please go through this and give some suggestions

0 Likes

I'm seeing "uordblks" still incrementing, just like axel.lin​.

mwf_mmfae​ this is a pretty significant issue.  I have previously had to reset the MCU after disconnect to deal with this issue.

I'm going to write a heap cleanup routine to work around in the short term.

0 Likes

Unfortunately the heap cleanup routine depended on malloc_debug, which appears to cause the BT libraries to exceed stack.   So I'm stuck right now with a memory leak.

0 Likes

dstudejio wrote:

Unfortunately the heap cleanup routine depended on malloc_debug, which appears to cause the BT libraries to exceed stack.   So I'm stuck right now with a memory leak.

Don't spend time to workaround this issue. It won't work and the code cannot be used in real product.

It simply needs to be fixed.

0 Likes

Agree it needs fixing - and quickly - but I can't halt development for this.

I've found the issue BT stack overflow issue continues without malloc_debug.  Some newly introduced BT stack usage on top of our own is causing stack overflow in this case. 

0 Likes

Fully worked around by using malloc_debug and freeing any TLS-related heap elements.

mwf_mmfae​ any updates on timeline/plan on this issue?

0 Likes

Noting yet. Sorry.

0 Likes

mwf_mmfae wrote:

Noting yet. Sorry.

Any update?

Anonymous
Not applicable

Hi axel,

Event in SDK 3.5.2, I am facing an issue with Reconnect to MQTT. Event when I try to do MQTT_Connect and Disconnect continuously in a while(1) I am facing issue with the system getting hung after some 5-6 loops. Can you suggest me a work around for this as I have been stuck with this issue from last 1 week.

  while(1)

{

  do

  {

  ret = aws_mqtt_conn_open( app_info.mqtt_object, mqtt_connection_event_cb );

  connection_retries++ ;

  } while ( ( ret != WICED_SUCCESS ) && ( connection_retries < WICED_MQTT_CONNECTION_NUMBER_OF_RETRIES ) );

do

{

ret = aws_mqtt_app_subscribe( app_info.mqtt_object, app_info.shadow_delta_topic , WICED_MQTT_QOS_DELIVER_AT_MOST_ONCE );

connection_retries++ ;

} while ( ( ret != WICED_SUCCESS ) && ( connection_retries < WICED_MQTT_CONNECTION_NUMBER_OF_RETRIES ) );

      shadow_close();

  wiced_rtos_delay_milliseconds(1000);

}

In my shadow close I am just calling

    mqtt_network_deinit(&(((mqtt_connection_t*)app_info.mqtt_object)->socket));

    mqtt_connection_deinit((mqtt_connection_t*) app_info.mqtt_object);

Am I doing any mistake. Please go through this and give some suggestions

0 Likes

anandram wrote:

Hi axel,

Event in SDK 3.5.2, I am facing an issue with Reconnect to MQTT. Event when I try to do MQTT_Connect and Disconnect continuously in a while(1) I am facing issue with the system getting hung after some 5-6 loops. Can you suggest me a work around for this as I have been stuck with this issue from last 1 week.

I don't work for cypress.

Please contact cypress team.

mwf_mmfae

0 Likes

Have you tested using WICED Studio 4?

0 Likes
Anonymous
Not applicable

No We are still using 3.5.2

Now at this moment its very tough for us to move to any other SDK version.

If you know the real problem can you hive me a patch set to have the Multiple Reconnection to MQTT possible. Its been 1 week and still I am not able figure out wats going wrong. Only 1 hint that I got was like it was getting to a hang state after it executes the ssl_handahake_async() whose source code also we dont have with us.

So I realy needs some serious amount of help from your side.

Thanks in advance

0 Likes

I spoke to the development team and we should be able to release SDK 3.7.0-7 next week prior to Thanksgiving.  This rev will have all of the patches which were applied in WICED Studio 4.  Moving to the latest rev, or at least testing it to confirm it fixes the issue would probably be your best bet as I am not sure how soon the developers will be able to look into your specific issue on SDK 3.5.2

0 Likes

mwf_mmfae wrote:

I spoke to the development team and we should be able to release SDK 3.7.0-7 next week prior to Thanksgiving.  This rev will have all of the patches which were applied in WICED Studio 4.  Moving to the latest rev, or at least testing it to confirm it fixes the issue would probably be your best bet as I am not sure how soon the developers will be able to look into your specific issue on SDK 3.5.2

mwf_mmfae

No. Keep asking people to wait for not yet release sdk is wrong.

The best choice is your team to confirm if anandram's code has anything wrong or not first. (only takes a few minutes)

If there is nothing wrong in anadram's code, your team needs to identify which part is wrong in the SDK version user reported.

You cannot keep asking people to test the SDK version they don't use in their projects.

For the SDK version people are using, there may have some issues people have code to workaround known issues.

Changing SDK version can make some workaround fail to work because new SDK may have behavior change that turn one bug to another bug or introduce new bugs. i.e. new sdk may introduce more *unknown* issues.

It's software. It's pretty common people find issues *after* release.

What if new release still does not fix people reported issue? Asking people to wait yet another release? Working this way just does not work at all.

You need to deliver exactly the fix for the issues reported by users, rather than give people a huge update release.

If you can provide the fix for user reported issues, no matter with source code/diff or with binary library.

The developers can verify the fix and apply the change to the SDK they use by themselves. (This will save you a lot of time to address issues on different SDK version.)

This way, people do not need to *wait* new release and it makes each SDK version more stable.

I think I already told you this multiple times, you are not listening.

Anonymous
Not applicable

Hi mwf_mmfae,

I even tried with the WIced SDK 4.0. I dont see much changes from the SDK 3.5.2 to 4.0 from MQTT & BESL library unless the core changes is in the static library you have provided in the BESL Folder. But still the same issue persists. Can you give me some insight on this ASAP

0 Likes

anandram wrote:

Hi mwf_mmfae,

I even tried with the WIced SDK 4.0. I dont see much changes from the SDK 3.5.2 to 4.0 from MQTT & BESL library unless the core changes is in the static library you have provided in the BESL Folder. But still the same issue persists. Can you give me some insight on this ASAP

Hi anandram​,

It's not clear to me about the issue you mean. Do you mean memory leak issue or other issues?

0 Likes
Anonymous
Not applicable

Hi Axel,

As I mentioned in an old post even I am not able to corner the case. When I do a debug I always see that the device goes into a hang after executing the ssl_handshake_server_async() function. But the source code for this we dont have in our hands. I dont think its a memory issue. Why I say this is because I tried running the https_client for some 20 minutes in a while(1) and it was running fine. So I don't think its a memory issue

0 Likes

anandram wrote:

Hi Axel,

As I mentioned in an old post even I am not able to corner the case. When I do a debug I always see that the device goes into a hang after executing the ssl_handshake_server_async() function. But the source code for this we dont have in our hands. I dont think its a memory issue. Why I say this is because I tried running the https_client for some 20 minutes in a while(1) and it was running fine. So I don't think its a memory issue

Ok, maybe you should create another discussion thread for the issue you mentioned with clear reproduce steps and the SDK version you tested.

It's a little bit misleading as the issue is different from the subject of this discussion thread.