Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

Open
malban opened this issue Jul 18, 2022 · 14 comments

Comments

@malban
Copy link
Contributor

malban commented Jul 18, 2022

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • source
  • ROS Version
    • galactic
  • Version or commit hash:
  • Laser unit:
    • gazebo simulated single plane lidar

Steps to reproduce issue

  • run slam_toolbox in async mode
  • publish scan messages at 20Hz
  • broadcast odom_frame_ to scan frame tf transform with a latency of 50+ ms

Expected behavior

scan messages are processed by slam_toolbox one at a time with a queue size of one, dropping messages as necessary

Actual behavior

all scan messages are dropped

Additional information

It seems like this is an artifact of using the tf message filter and having a queue size of 1. The message filter is supposed to wait until the transform is ready before activating the callback, but before that happens a new message is received. Since the queue size is 1, the callback can never be triggered if the wait time is greater than the period of the messages.

Increasing the queue size is problematic, because then a queue of messages can build up, which is not desired for async mode.

What seems to work fine is to use a regular subscriber with a queue size of 1, and put an explicit canTransform check at the head of the callbacks. The tf timeout can then be set to a user configurable value.

@SteveMacenski
Copy link
Owner

How fast is your odometry being published? 20hz is exceptionally low. Usually this is between 50-100hz, but often higher.

Even if you had a queue of built up messages to process, I don't think that is a guarantee that this resolves you issue. The queue sizes more helps with not dropping relatively recent data when processing takes too long. The transform timeout for the filter is set by default to 0.5s which should be plenty https://github.com/SteveMacenski/slam_toolbox/blob/ros2/src/slam_toolbox_common.cpp#L159 if you're saying you only have a 50ms delay. Maybe something isn't resolving there correctly?

@malban
Copy link
Contributor Author

malban commented Jul 18, 2022

The transform timeout for the filter is set by default to 0.5s which should be plenty

The 0.5s timeout is not the problem and is is plenty long enough. The issue is that, with the tf message filter, wait time is effectively capped at the period between scan messages if the message queue size is only 1. If the scan messages are coming in at 20Hz, then the timeout is effectively 50ms.

Even if you had a queue of built up messages to process, I don't think that is a guarantee that this resolves you issue. The queue sizes more helps with not dropping relatively recent data when processing takes too long.

Agreed. The queue size should remain 1.

How fast is your odometry being published? 20hz is exceptionally low. Usually this is between 50-100hz, but often higher.

The odom rate is relatively high. I experience this issue with gazebo when stressing out my dev system, which is resulting in latency in the transforms. When this latency is larger than 50ms, scan messages can't be processed in slam_toolbox.

I'll see if I can make a minimal bag file that triggers this behavior.

The fix is pretty simple and makes the node more resilient to non-optimal inputs.

@SteveMacenski
Copy link
Owner

SteveMacenski commented Jul 18, 2022

If the scan messages are coming in at 20Hz, then the timeout is effectively 50ms.

I don't understand that comment, why do you believe that's true? I do not believe message filters is going to preempt trying to find a transform of a message frame just because a new message came in. It should be using the transform timeout duration and passing along from that, other messages would simply be dropped if a new one came in while it was checking for a transform.

I would try to find a solution within the use of message filters, that is going to be staying. If there's some bug in message filters, then we can resolve that over in that repo

@malban
Copy link
Contributor Author

malban commented Jul 18, 2022

why do you believe that's true

Because that is what I'm experiencing. I'll try to create a minimal bag file.

I would try to find a solution within the use of message filters

that would be ideal

@SteveMacenski
Copy link
Owner

I'm heading out on PTO in 4 days for about a month, so don't make a bag file on my account, I'm not going to be able to open it until I'm back at the earliest.

I'd look at TF message filters code directly and see if you can find the logic for handling messages that you think you're experiencing. That's not been my understanding of how it works, so if not, that might point to another potential issue

@malban
Copy link
Contributor Author

malban commented Jul 19, 2022

well, I went ahead a created some bag files anyway, since it will help with debugging the message filter regardless.

The bag files contain exactly the same messages, /scan and /tf, with /tf just containing the transform between laser and odom.

The valid bag file works as expected when played back into slam_toolbox. The shifted_50ms bag file has been modified to move all of the tf messages 50ms later in the bag file so they get played slightly later. This is enough to cause the message filter to drop all of the messages with messages like this:

Message Filter dropping message: frame 'laser' at time 60.746 for reason 'discarding message because the queue is full'

valid.zip
shifted_50ms.zip

@SteveMacenski
Copy link
Owner

SteveMacenski commented Jul 19, 2022

Why exactly is your TF delayed so much? I know that's not really the point here, but that seems illustrative of a problem. Global localization being delayed or irregular is baked into the system, but localized odometry is super important not only for SLAM Toolbox, but for pretty much every consumer of transformation information in the system. Message filters are everywhere in ROS so if you have a problem here, you're going to run into that all over the place.

Uncharacteristically, I don't have a solution in mind for this. I suppose you could add delay your laser scans, e.g. /delayed_scans, its not like they're very important to be timely for the purposes of this node providing global localization. Even if we added a larger queue size, you're basically banking on the latency from the ROS system or processing here to virtually delay the scan's processing which would be flaky at best. If we add canTransform to the callback, we're blocking for TF and owning that logic which message filter provides inline. This is a non-trivial problem I can say that I've never run into. I suppose 20+hz odometry with reasonably low latency has always been a basis of which I've started to build robots from, and when it wasn't possible, I went to the low levels of the robot to correct it rather than marching forward as "this is how it is".

@malban
Copy link
Contributor Author

malban commented Jul 19, 2022

Thanks for your thoughts on this. I agree with your points about low level odometry.

If we add canTransform to the callback, we're blocking for TF and owning that logic which message filter provides inline.

You're already blocking for TF, with the intent of blocking up to 500ms by default, though that is way overkill. In practice, on a well tuned system, as you say, there will be little if any blocking on TF, because it will be timely. So having the large default delay doesn't hurt anything in the normal case, but it does makes slam_toolbox more robust to sub-optimal inputs.

Unfortunately, the tf message filter doesn't seem to work as expected in this case, with queue size 1.

The message filters are great, but it's saving a single, very common, tf check at the top of the scan callbacks. The explicit check will result in the intended behavior and provide more useful diagnostic information than what is currently output.

I suppose 20+hz odometry with reasonably low latency has always been a basis of which I've started to build robots from, and when it wasn't possible, I went to the low levels of the robot to correct it rather than marching forward as "this is how it is"

I don't have control of the low level system or it's constraints. Undoubtedly, things are sub-optimal, but my understanding of the intended callback design in slam_toolbox, based on your comment, is that it should wait up to 500ms for the transform:

I do not believe message filters is going to preempt trying to find a transform of a message frame just because a new message came in. It should be using the transform timeout duration and passing along from that, other messages would simply be dropped if a new one came in while it was checking for a transform.

The bag files I uploaded should demonstrate it's not working as intended and the messages are likely getting preempted as new messages push them off the queue.

I'll investigate if there is an easy fix in the tf message filter, but it's possible that it's the result of it's fundamental design vs. a simple bug.

In the meantime, the MR I made resolves the issue for me, so it's not something I'm searching for a solution for. I do think it would probably be beneficial to other users, which is why I made the issue and MR.

@SteveMacenski
Copy link
Owner

with queue size 1

Does a queue size of 2 make any such difference if you think there's an edge case not properly handled? It wouldn't be a big deal either way actually to keep a queue size of 2 if it functionally resolves a problem.

@malban
Copy link
Contributor Author

malban commented Jul 20, 2022

So, it does look like it's an issue of the tf message filter preempting messages that are in queue while they are waiting for transforms.

I made a branch which prevents the message filter from preempting the oldest message that is waiting for a transform and it seems to work as I would expect: ros2/geometry2@galactic...malban:geometry2:prevent-message-preemption-galactic

Would you mind taking a look at that? If you think it looks reasonable I'll make a MR on geometry2

Does a queue size of 2 make any such difference

Queue size 2 doesn't seem to work for the bag data I uploaded, but 3 does. I'm not sure why 2 doesn't work, I'll need to step through the message filter code to see why.

@SteveMacenski
Copy link
Owner

Queue size 2 doesn't seem to work for the bag data I uploaded, but 3 does. I'm not sure why 2 doesn't work, I'll need to step through the message filter code to see why.

I suspect that's just a function of the specific delay in your system, not a general rule.

I'll take a look!

@SteveMacenski
Copy link
Owner

Yeah, that patch looks reasonable! Tag me in the PR and I'll make sure to keep track of it

@malban
Copy link
Contributor Author

malban commented Jul 21, 2022

I suspect that's just a function of the specific delay in your system, not a general rule.

You're probably right, it's just a little weird that none of the messages appear to be dropped in the valid bag, but all of them are dropped in the shifted_50ms bag for both queue size 1 and 2.

@LeifHookedWireless
Copy link

I'm seeing something very similar.
My transforms are coming over the ROS1->ROS2 bridge so they have high latency: 35-40ms.
However my scans are being published on an old lidar, also over the bridge with frequency or 7-8Hz and latency of ~140ms.
If transform_publish_period is small (<5ms), then the transforms are published steadily but the scans are rarely , if ever, processed. i.e. AsynchrousSlamToolbox::laserCallback() is simply not called 99% of the time.
Of course the incoming scans fill the buffer in the MessageFilter and the console is spammed with discarding message because the queue is full.

I haven't determined the root cause, but I assume that the laserCallback() thread is being starved somehow by the handling of transform updates.

In my case, setting transform_publish_period >= 0.05 seems to fix the problem.

May not be the same issue, but I'm posting this here in case someone else has the same problem and comes looking for a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants