Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to update localMessageState results unwanted message reordering #2693

Open
chan-reclip opened this issue Jul 5, 2023 · 3 comments
Open
Labels
🔎 Investigating This issue is currently being investigated (Not reproducible yet) 🌐 SDK: StreamChat (LLC) Tasks related to the StreamChat LLC SDK

Comments

@chan-reclip
Copy link

chan-reclip commented Jul 5, 2023

NB: This bug report is based on my speculation and I don't have repro steps or a concrete proof that what I described below is exactly what happened.

We've recently encountered this issue again, so I had a look at the code to debug possible reasons.
While browsing through the call stack, I found a suspicious code in MessageRepository.swift.

For whatever reason, database.write (L53) fails, we will get into the error handling code (L58-61) which basically calls markMessageAsFailedToSend to update localMessageState to .sendingFailed.

// Change the message state to `.sending` and the proceed with the actual sending
self?.database.write({
let messageDTO = $0.message(id: messageId)
messageDTO?.localMessageState = .sending
}, completion: { error in
if let error = error {
log.error("Error changing localMessageState message with id \(messageId) to `sending`: \(error)")
self?.markMessageAsFailedToSend(id: messageId) {
completion(.failure(.failedToSendMessage))
}
return
}

I see two issues here:

  1. The logic to update localMessageState works only when the old state is .sending. Given we failed to update the state to .sending, it won't work.
    if dto?.localMessageState == .sending {
    dto?.localMessageState = .sendingFailed
    }
  2. Even if the old state checking issue is somehow fixed, I'm not too sure it can successfully update the state to .sendingFailed because that's the same database.write operation which just failed.

Anyways, regardless of this issue, MessageSendingQueue will remove the request once it's tried.

self?.messageRepository.sendMessage(with: request.messageId) { _ in
self?.removeRequestAndContinue(request)
}

The above code ignores the result _ because, I believe, it assume sendMessage will either succeed or update localMessageState to .sendingFailed. Unfortunately, that assumption seems to be wrong.

If there are more requests in the send request queue, those requests could hit the backend earlier than the above failed request. Because the failed message's localMessageState will stay as .pendingSend, it will be retried later. This will eventually result message delaying and thus reordering.

@chan-reclip
Copy link
Author

cc @jfru

@polqf
Copy link
Contributor

polqf commented Jul 5, 2023

Hi @chan-reclip,

Thank you for the report. We'll have a look into it.

Couple of comments:

Which evidence do you have that support that you reach MessageRepository L57 with an error? Is it logs or is it an assumption in this case? Just asking because I am not sure based on the comment if you saw that happening or not. I think this would be the most important thing for us to try to help you 😄
If there is any evidence, could you tie this to a particular event happening in your app?

Regardless of the previous:

Even if the old state checking issue is somehow fixed, I'm not too sure it can successfully update the state to .sendingFailed because that's the same database.write operation which just failed

If the write fails, chances are that either the message does not exist, that it has been deleted, or that the DB is in a bad state. I am mostly inclined towards the latest as there is no throwing function that is called in that process that could justify an error being reported in L57.

regardless of this issue, MessageSendingQueue will remove the request once it's tried

Because the failed message's localMessageState will stay as .pendingSend, it will be retried later. This will eventually result message delaying and thus reordering

Based on the assumption that I stated above, if seems "safe" to assume that if the message was not retrieved in the previous step, there is no way to recover from it. So this is the reason why we are ignoring the error, and removing the message from the queue.

In any case, we are going to have a deeper look into it whenever possible.

@nuno-vieira nuno-vieira added 🔎 Investigating This issue is currently being investigated (Not reproducible yet) 🌐 SDK: StreamChat (LLC) Tasks related to the StreamChat LLC SDK labels Jul 5, 2023
@chan-reclip
Copy link
Author

Hi @polqf,

Thanks for the prompt response.

Which evidence do you have that support that you reach MessageRepository L57 with an error? Is it logs or is it an assumption in this case? Just asking because I am not sure based on the comment if you saw that happening or not. I think this would be the most important thing for us to try to help you 😄

Totally agreed but unfortunately I couldn't get any logs from the original reporter. 😞
As I mentioned at the top of description, it's purely based on my speculation. I'm just judging from my understanding of the related code, if we ever reach L57, the issue can be reproduced.

If the write fails, chances are that either the message does not exist, that it has been deleted, or that the DB is in a bad state. I am mostly inclined towards the latest as there is no throwing function that is called in that process that could justify an error being reported in L57.

Yes, I also think the message should exist as we check it just before the write.

guard let dto = self?.database.backgroundReadOnlyContext.message(id: messageId) else {
log.error("Trying to send a message with id \(messageId) but the message was deleted.")
completion(.failure(.messageDoesNotExist))
return
}

Based on the assumption that I stated above, if seems "safe" to assume that if the message was not retrieved in the previous step, there is no way to recover from it.

It's "safe" if message was not retrieved. But haven't you said you're mostly inclined to DB being in a bad state?

In any case, we are going to have a deeper look into it whenever possible.

I'll be waiting for it.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔎 Investigating This issue is currently being investigated (Not reproducible yet) 🌐 SDK: StreamChat (LLC) Tasks related to the StreamChat LLC SDK
Projects
None yet
Development

No branches or pull requests

3 participants