Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lschuermann
Copy link
Member

@lschuermann lschuermann commented May 2, 2024

Pull Request Overview

This pull request is a second revision of #3973, with a full rewrite of the alarm driver. See below for why I deem this the best strategy forward.

Original Motivation

This change introduces new APIs on the Ticks trait of kernel/src/hil/time.rs which allow consumers of this type to know the underlying timer's actual ticks width (may be less than 32 bit), and to right-pad Ticks values for it to wrap at exactly (2 ** 32) - 1 ticks.

By exposing this shifted Ticks value and an appropriately scaled tick frequency to userspace, Tock finally conforms to assumptions made in libtock-c long ago:

 The client should not assume anything about the underlying clock
 used by an implementation other than that it is running at
 sufficient frequency to deliver at least millisecond granularity
 and that it is a 32-bit clock (i.e. it will wrap at 2^32 clock
 ticks).

This predictable timer interface can allow userspace to build timer infrastructure that can schedule alarms or sleep for longer than 2 ** Timer Width ticks.

Apart from now conforming to libtock-c's expectations, by scaling both timer and frequency this change should not break any existing userspace applications (apart from triggering an otherwise unrelated bug in libtock-c's virtual alarms). It does modify code for which a relevant TRD (105) exists, but that TRD is still in draft state.

Alarm Driver Rewrite

The wrapping logic in the alarm driver was likely still broken in #3973 with > 32 bit timers. This is because the alarm driver used to deal only with 32-bit values internally. However, it would not record whether those values were potentially stemming from a timer of a smaller timer. This then causes issues when it will set a new alarm in its wrapping logic.

Apart from that, the fact that it first converted to 32 bit, and then back again to the underlying timer width, up- and down-scaling as necessary, is very confusing. The code interleaves iterating over timers, computing wrapping, and controlling the hardware in a way that makes it hard to reason about, and virtually impossible to unit-test.

I refactored the driver to clearly separate out this logic:

  • Expiration now always refers to a set alarm. Option<Expiration> refers to an alarm that can be disabled. Distinguishing between these on a type level (versus one single enum Expiration for both) allows us to get rid of a many checks.
  • earliest_alarm iterates over a list of alarms, invoking a callback on each expired one, and returns the earliest non-expired one. This is very easily unit-testable and allows us to combine callback- and re-arm operations into a single handler.
  • process_rearm_or_callback is this handler. It is a thin wrapper around the above, and controls the hardware (this part is not easily testable).
  • rearm_u32_left_justified_expiration does all the magic wrapping / shifting logic. It is not clobbered with any details about the underlying alarm, but instead just converts a 32-bit left-justified ticks value into any underlying A::Ticks, regardless of width. It does appropriate wrapping for smaller and larger timers. This is extensively unit-tested.

The logic in the Driver implementation is simplified as well.

I'm not happy with the fact that this is a full re-write, but I am much more confident that the implementation is correct and has less edge-cases than with the old architecture. I hope others find the new code structure equally readable.

Testing Strategy

LiteX Sim CI and unit tests introduced in this PR:

    Finished test [optimized + debuginfo] target(s) in 0.52s
     Running unittests src/lib.rs (/home/leons/develop/tock/tock/target/debug/deps/capsules_core-d4d0469fae668189)

running 17 tests
test alarm::test::test_earliest_alarm_expired_stop ... ok
test alarm::test::test_earliest_alarm_no_alarms ... ok
test alarm::test::test_earliest_alarm_multiple_expired ... ok
test alarm::test::test_earliest_alarm_multiple_unexpired ... ok
test alarm::test::test_rearm_24bit_left_justified_noref_basic ... ok
test alarm::test::test_rearm_24bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_32bit_left_justified_noref_basic ... ok
test alarm::test::test_rearm_32bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_64bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_64bit_left_justified_refnowrap_dtnorwap ... ok
test alarm::test::test_rearm_64bit_left_justified_refnowrwap_dtwrap ... ok
test alarm::test::test_rearm_64bit_left_justified_refwrap_dtnowrap ... ok

TODO or Help Wanted

Updates to the TRD.

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make prepush.

@github-actions github-actions bot added the HIL This affects a Tock HIL interface. label May 2, 2024
@lschuermann lschuermann force-pushed the dev/userspace-predictable-ticks-width-2 branch 2 times, most recently from 04bad49 to c8e4da7 Compare May 2, 2024 23:34
Comment on lines 63 to 82
// TODO: document
fn earliest_alarm<UD, R, F: FnOnce(Expiration<A::Ticks>, &UD) -> Option<R>>(
now: A::Ticks,
expirations: impl Iterator<Item = (Expiration<A::Ticks>, UD, F)>,
// expired_handler: &mut impl FnMut(&Expiration<A::Ticks>, &UD) -> Option<R>,
) -> Result<Option<(Expiration<A::Ticks>, UD)>, (Expiration<A::Ticks>, UD, R)> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does much of the heavy lifting of re-arming the timer, as it's generic. This also makes it hard to understand. It is in desperate need of documentation. expired_handler as a parameter has been integrated into expirations and should be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this misses a corner case where:

now = 1000
expirations[0].fire_at = 1050
expirations[1].fire_at = 950

if the expirations[1].expired_handler runs for >50 ticks, then by the time expirations[0] is processed as the "earliest alarm" in the block at line 193, its dt will look far in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might well be wrong, but I don't think that this is a concern here.

First of all, this function takes now as a parameter. We do this primarily to avoid performing a volatile read of the underlying timer's register multiple times, but it also ensures that this function and all of the expired-callbacks are executed from a consistent reference time. This means that, regardless of how long each expired handler takes, we always identify the earliest alarm from this reference time:

// Ask the clock about a current reference once. This can incur a
// volatile read, and this may not be optimized if done in a loop:
let now = self.alarm.now();
let expired_handler = |expired: Expiration<A::Ticks>, process_id: &ProcessId| {
// This closure is run on every expired alarm, _after_ the `enter()`
// closure on the Grant iterator has returned. We are thus not
// risking reentrancy here.
// Enter the app's grant again:
let _ = self.app_alarms.enter(*process_id, |alarm_state, upcalls| {
// Reset this app's alarm:
alarm_state.expiration = None;
// Keep track of the number of armed alarms, to allow for the
// early-exit at the top of this function:
self.num_armed.set(self.num_armed.get() - 1);
// Deliver the upcall:
upcalls
.schedule_upcall(
ALARM_CALLBACK_NUM,
(
now.into_u32_left_justified() as usize,
expired.reference.wrapping_add(expired.dt).into_usize(),
0,
),
)
.ok();
});
// Proceed iteration across expirations:
None::<()>
};
// Compute the earliest alarm, and invoke the `expired_handler` for
// every expired alarm. This will issue a callback and reset the alarms
// respectively.
let res = Self::earliest_alarm(
now,
// Pass an interator of all non-None expirations:
self.app_alarms.iter().filter_map(|app| {
let process_id = app.processid();
app.enter(|alarm_state, _upcalls| {
if let Some(exp) = alarm_state.expiration {
Some((exp, process_id, expired_handler))
} else {
None
}
})
}),
);

Apart from that, we also don't just save the fire_at value, but reference and dt, where fire_at = reference.wrapping_add(dt). This provides us with significant margin of error, provided that dt is significantly smaller than Ticks::max(). Even if we were to take long, once now surpasses reference + dt and did not tick another Ticks::max() - dt, we'll still treat it as expired. We also don't risk mistaking a previous timer as the "earliest", as when now is not within [reference; dt), we treat is as expired instead.

This does raise another question though: I believe that all of this falls apart when Ticks::max() - dt is very small. In this case, we'd run a realistic risk of never firing a timer, as we'd never trigger the case where now is within [reference; dt).

Should we limit the maximum value of dt, perhaps to Ticks::max() / 2, to ensure that we always reliably detect when it has expired and have enough time to schedule the callback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, I think lost in my reading through the web UI diff view was where the reference, dt was coming from that is ultimately passed to set_alarm.

The concern does come back up in set_alarm implementations, as they all do do a fresh read of now. I looked through several (chosen arbitrarily as example, the rp2040 alarm implementation does this right) and all seem to handle the case where now isn't in the [ref,dt] range. Most seem to also account for minimum_dt() in the set_alarm implementation, but it looks like esp32 might be missing that nuance. Not a bug for this PR though.

bradjc
bradjc previously approved these changes May 6, 2024
Copy link
Contributor

@bradjc bradjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the person to really dig into the alarm capsule and reason about the correctness of this change.

But, I did look over the code and it generally makes sense to me and seems more clear.

@bradjc
Copy link
Contributor

bradjc commented May 7, 2024

Can we merge just the hil changes now? Those are the same between the two PRs, right?

I say we just merge the other PR with the quick fix, and leave this open until we can get more eyes on it.

@lschuermann
Copy link
Member Author

lschuermann commented May 8, 2024

@bradjc The reason this PR exists is precisely because I'm having a really hard time reasoning whether the changes in the other PR lead to "sane" behavior (i.e. the same behavior that this PR exhibits).

It does, however, seem to resolve the issue we're facing currently, so I say let's merge it! Ideally, neither PR introduces breaking changes for userspace. And indeed, the HIL changes are the same.

@github-merge-queue github-merge-queue bot dismissed bradjc’s stale review May 9, 2024 16:27

The merge-base changed after approval.

ppannuto
ppannuto previously approved these changes May 22, 2024
Copy link
Member

@ppannuto ppannuto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the extent that I'm capable of reading and reasoning through timer code, this looks good to me.

@lschuermann lschuermann dismissed ppannuto’s stale review May 22, 2024 17:13

The merge-base changed after approval.

…d tests

Before this change, and in particular with the introduction of tock#3973,
it is very hard to reason about the wrapping logic in the alarm driver
with > 32 bit timers. This is because the alarm driver used to deal
only with 32-bit values internally. However, it would not record
whether those values were potentially stemming from a timer of a
smaller timer. This then causes issues when it will set a new alarm in
its wrapping logic.

Apart from that, the fact that it first converted to 32 bit, and then
back again to the underlying timer width, up- and down-scaling as
necessary, is very confusing. The code interleaves iterating over
timers, computing wrapping, and controlling the hardware in a way that
makes it hard to reason about, and virtually impossible to unit-test.

I refactored the driver to clearly separate out this logic:

- `Expiration` now always refers to a set alarm. `Option<Expiration>`
  refers to an alarm that can be disabled. Distinguishing between
  these on a type level (versus one single `enum Expiration` for both)
  allows us to get rid of a many checks.

- `earliest_alarm` iterates over a list of alarms, invoking a callback
  on each expired one, and returns the earliest non-expired one. This
  is very easily unit-testable and allows us to combine callback- and
  re-arm operations into a single handler.

- `process_rearm_or_callback` is this handler. It is a thin wrapper
  around the above, and controls the hardware (this part is not easily
  testable).

- `rearm_u32_left_justified_expiration` does all the magic wrapping /
  shifting logic. It is not clobbered with any details about the
  underlying alarm, but instead just converts a 32-bit left-justified
  ticks value into any underlying A::Ticks, regardless of width. It
  does appropriate wrapping for smaller and larger timers. This is
  extensively unit-tested.

The logic in the `Driver` implementation is simplified as well.
@lschuermann lschuermann force-pushed the dev/userspace-predictable-ticks-width-2 branch from c8e4da7 to 8a4f730 Compare May 28, 2024 21:11
@github-actions github-actions bot removed the HIL This affects a Tock HIL interface. label May 28, 2024
@lschuermann
Copy link
Member Author

I rebased and added comments to the magic earliest_alarm and process_rearm_or_callback functions.

I also removed the num_armed counter optimization. Previous code would count the number of grants that had an unexpired Expiration set, and update this count on the respective syscalls / alarm events. However, this breaks in the face of app faults / restarts. The capsule is not informed of such events, and thus it cannot decrement num_armed, or even check whether the previous grant state had a pending alarm for that app.

To support these optimizations and other de-initialization of peripherals, for instance to turn them into a low-power state again after an app fault, we'd want to have an additional handler for this in capsules. Historically there have been good reasons why we'd want to avoid such "broadcast" events, in particular over the dyn-traits that are our drivers. However, not supporting this does seem to limit our ability to build expressive and optimized drivers in practice.

alevy
alevy previously approved these changes May 31, 2024
Copy link
Member

@alevy alevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

@alevy alevy added P-Significant This is a substancial change that requires review from all core developers. and removed needs-rebase labels May 31, 2024
@alevy alevy removed request for ppannuto and bradjc May 31, 2024 02:13
@lschuermann
Copy link
Member Author

@alevy Thanks for fixing the tests! Forgot to carry the removal of the num_armed to those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-Significant This is a substancial change that requires review from all core developers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants