capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

lschuermann · 2024-05-02T23:23:28Z

Pull Request Overview

This pull request is a second revision of #3973, with a full rewrite of the alarm driver. See below for why I deem this the best strategy forward.

Original Motivation

This change introduces new APIs on the Ticks trait of kernel/src/hil/time.rs which allow consumers of this type to know the underlying timer's actual ticks width (may be less than 32 bit), and to right-pad Ticks values for it to wrap at exactly (2 ** 32) - 1 ticks.

By exposing this shifted Ticks value and an appropriately scaled tick frequency to userspace, Tock finally conforms to assumptions made in libtock-c long ago:

 The client should not assume anything about the underlying clock
 used by an implementation other than that it is running at
 sufficient frequency to deliver at least millisecond granularity
 and that it is a 32-bit clock (i.e. it will wrap at 2^32 clock
 ticks).

This predictable timer interface can allow userspace to build timer infrastructure that can schedule alarms or sleep for longer than 2 ** Timer Width ticks.

Apart from now conforming to libtock-c's expectations, by scaling both timer and frequency this change should not break any existing userspace applications (apart from triggering an otherwise unrelated bug in libtock-c's virtual alarms). It does modify code for which a relevant TRD (105) exists, but that TRD is still in draft state.

Alarm Driver Rewrite

The wrapping logic in the alarm driver was likely still broken in #3973 with > 32 bit timers. This is because the alarm driver used to deal only with 32-bit values internally. However, it would not record whether those values were potentially stemming from a timer of a smaller timer. This then causes issues when it will set a new alarm in its wrapping logic.

Apart from that, the fact that it first converted to 32 bit, and then back again to the underlying timer width, up- and down-scaling as necessary, is very confusing. The code interleaves iterating over timers, computing wrapping, and controlling the hardware in a way that makes it hard to reason about, and virtually impossible to unit-test.

I refactored the driver to clearly separate out this logic:

Expiration now always refers to a set alarm. Option<Expiration> refers to an alarm that can be disabled. Distinguishing between these on a type level (versus one single enum Expiration for both) allows us to get rid of a many checks.
earliest_alarm iterates over a list of alarms, invoking a callback on each expired one, and returns the earliest non-expired one. This is very easily unit-testable and allows us to combine callback- and re-arm operations into a single handler.
process_rearm_or_callback is this handler. It is a thin wrapper around the above, and controls the hardware (this part is not easily testable).
rearm_u32_left_justified_expiration does all the magic wrapping / shifting logic. It is not clobbered with any details about the underlying alarm, but instead just converts a 32-bit left-justified ticks value into any underlying A::Ticks, regardless of width. It does appropriate wrapping for smaller and larger timers. This is extensively unit-tested.

The logic in the Driver implementation is simplified as well.

I'm not happy with the fact that this is a full re-write, but I am much more confident that the implementation is correct and has less edge-cases than with the old architecture. I hope others find the new code structure equally readable.

Testing Strategy

LiteX Sim CI and unit tests introduced in this PR:

    Finished test [optimized + debuginfo] target(s) in 0.52s
     Running unittests src/lib.rs (/home/leons/develop/tock/tock/target/debug/deps/capsules_core-d4d0469fae668189)

running 17 tests
test alarm::test::test_earliest_alarm_expired_stop ... ok
test alarm::test::test_earliest_alarm_no_alarms ... ok
test alarm::test::test_earliest_alarm_multiple_expired ... ok
test alarm::test::test_earliest_alarm_multiple_unexpired ... ok
test alarm::test::test_rearm_24bit_left_justified_noref_basic ... ok
test alarm::test::test_rearm_24bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_32bit_left_justified_noref_basic ... ok
test alarm::test::test_rearm_32bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_64bit_left_justified_noref_wrapping ... ok
test alarm::test::test_rearm_64bit_left_justified_refnowrap_dtnorwap ... ok
test alarm::test::test_rearm_64bit_left_justified_refnowrwap_dtwrap ... ok
test alarm::test::test_rearm_64bit_left_justified_refwrap_dtnowrap ... ok

TODO or Help Wanted

Updates to the TRD.

Documentation Updated

Updated the relevant files in /docs, or no updates are required.

Formatting

Ran make prepush.

lschuermann · 2024-05-02T23:53:29Z

capsules/core/src/alarm.rs

+    // TODO: document
+    fn earliest_alarm<UD, R, F: FnOnce(Expiration<A::Ticks>, &UD) -> Option<R>>(
+        now: A::Ticks,
+        expirations: impl Iterator<Item = (Expiration<A::Ticks>, UD, F)>,
+        // expired_handler: &mut impl FnMut(&Expiration<A::Ticks>, &UD) -> Option<R>,
+    ) -> Result<Option<(Expiration<A::Ticks>, UD)>, (Expiration<A::Ticks>, UD, R)> {


This function does much of the heavy lifting of re-arming the timer, as it's generic. This also makes it hard to understand. It is in desperate need of documentation. expired_handler as a parameter has been integrated into expirations and should be removed.

I believe this misses a corner case where:

now = 1000 expirations[0].fire_at = 1050 expirations[1].fire_at = 950

if the expirations[1].expired_handler runs for >50 ticks, then by the time expirations[0] is processed as the "earliest alarm" in the block at line 193, its dt will look far in the future.

I might well be wrong, but I don't think that this is a concern here.

First of all, this function takes now as a parameter. We do this primarily to avoid performing a volatile read of the underlying timer's register multiple times, but it also ensures that this function and all of the expired-callbacks are executed from a consistent reference time. This means that, regardless of how long each expired handler takes, we always identify the earliest alarm from this reference time:

tock/capsules/core/src/alarm.rs

Lines 139 to 190 in c8e4da7

// Ask the clock about a current reference once. This can incur a

// volatile read, and this may not be optimized if done in a loop:

let now = self.alarm.now();

let expired_handler = |expired: Expiration<A::Ticks>, process_id: &ProcessId| {

// This closure is run on every expired alarm, _after_ the `enter()`

// closure on the Grant iterator has returned. We are thus not

// risking reentrancy here.

// Enter the app's grant again:

let _ = self.app_alarms.enter(*process_id, |alarm_state, upcalls| {

// Reset this app's alarm:

alarm_state.expiration = None;

// Keep track of the number of armed alarms, to allow for the

// early-exit at the top of this function:

self.num_armed.set(self.num_armed.get() - 1);

// Deliver the upcall:

upcalls

.schedule_upcall(

ALARM_CALLBACK_NUM,

(

now.into_u32_left_justified() as usize,

expired.reference.wrapping_add(expired.dt).into_usize(),

0,

),

)

.ok();

});

// Proceed iteration across expirations:

None::<()>

};

// Compute the earliest alarm, and invoke the `expired_handler` for

// every expired alarm. This will issue a callback and reset the alarms

// respectively.

let res = Self::earliest_alarm(

now,

// Pass an interator of all non-None expirations:

self.app_alarms.iter().filter_map(|app| {

let process_id = app.processid();

app.enter(|alarm_state, _upcalls| {

if let Some(exp) = alarm_state.expiration {

Some((exp, process_id, expired_handler))

} else {

None

}

})

}),

);

Apart from that, we also don't just save the fire_at value, but reference and dt, where fire_at = reference.wrapping_add(dt). This provides us with significant margin of error, provided that dt is significantly smaller than Ticks::max(). Even if we were to take long, once now surpasses reference + dt and did not tick another Ticks::max() - dt, we'll still treat it as expired. We also don't risk mistaking a previous timer as the "earliest", as when now is not within [reference; dt), we treat is as expired instead.

This does raise another question though: I believe that all of this falls apart when Ticks::max() - dt is very small. In this case, we'd run a realistic risk of never firing a timer, as we'd never trigger the case where now is within [reference; dt).

Should we limit the maximum value of dt, perhaps to Ticks::max() / 2, to ensure that we always reliably detect when it has expired and have enough time to schedule the callback?

Ah, okay, I think lost in my reading through the web UI diff view was where the reference, dt was coming from that is ultimately passed to set_alarm.

The concern does come back up in set_alarm implementations, as they all do do a fresh read of now. I looked through several (chosen arbitrarily as example, the rp2040 alarm implementation does this right) and all seem to handle the case where now isn't in the [ref,dt] range. Most seem to also account for minimum_dt() in the set_alarm implementation, but it looks like esp32 might be missing that nuance. Not a bug for this PR though.

bradjc

I'm not the person to really dig into the alarm capsule and reason about the correctness of this change.

But, I did look over the code and it generally makes sense to me and seems more clear.

bradjc · 2024-05-07T18:33:16Z

Can we merge just the hil changes now? Those are the same between the two PRs, right?

I say we just merge the other PR with the quick fix, and leave this open until we can get more eyes on it.

lschuermann · 2024-05-08T13:41:54Z

@bradjc The reason this PR exists is precisely because I'm having a really hard time reasoning whether the changes in the other PR lead to "sane" behavior (i.e. the same behavior that this PR exhibits).

It does, however, seem to resolve the issue we're facing currently, so I say let's merge it! Ideally, neither PR introduces breaking changes for userspace. And indeed, the HIL changes are the same.

The merge-base changed after approval.

ppannuto

To the extent that I'm capable of reading and reasoning through timer code, this looks good to me.

The merge-base changed after approval.

…d tests Before this change, and in particular with the introduction of tock#3973, it is very hard to reason about the wrapping logic in the alarm driver with > 32 bit timers. This is because the alarm driver used to deal only with 32-bit values internally. However, it would not record whether those values were potentially stemming from a timer of a smaller timer. This then causes issues when it will set a new alarm in its wrapping logic. Apart from that, the fact that it first converted to 32 bit, and then back again to the underlying timer width, up- and down-scaling as necessary, is very confusing. The code interleaves iterating over timers, computing wrapping, and controlling the hardware in a way that makes it hard to reason about, and virtually impossible to unit-test. I refactored the driver to clearly separate out this logic: - `Expiration` now always refers to a set alarm. `Option<Expiration>` refers to an alarm that can be disabled. Distinguishing between these on a type level (versus one single `enum Expiration` for both) allows us to get rid of a many checks. - `earliest_alarm` iterates over a list of alarms, invoking a callback on each expired one, and returns the earliest non-expired one. This is very easily unit-testable and allows us to combine callback- and re-arm operations into a single handler. - `process_rearm_or_callback` is this handler. It is a thin wrapper around the above, and controls the hardware (this part is not easily testable). - `rearm_u32_left_justified_expiration` does all the magic wrapping / shifting logic. It is not clobbered with any details about the underlying alarm, but instead just converts a 32-bit left-justified ticks value into any underlying A::Ticks, regardless of width. It does appropriate wrapping for smaller and larger timers. This is extensively unit-tested. The logic in the `Driver` implementation is simplified as well.

lschuermann · 2024-05-28T21:16:04Z

I rebased and added comments to the magic earliest_alarm and process_rearm_or_callback functions.

I also removed the num_armed counter optimization. Previous code would count the number of grants that had an unexpired Expiration set, and update this count on the respective syscalls / alarm events. However, this breaks in the face of app faults / restarts. The capsule is not informed of such events, and thus it cannot decrement num_armed, or even check whether the previous grant state had a pending alarm for that app.

To support these optimizations and other de-initialization of peripherals, for instance to turn them into a low-power state again after an app fault, we'd want to have an additional handler for this in capsules. Historically there have been good reasons why we'd want to avoid such "broadcast" events, in particular over the dyn-traits that are our drivers. However, not supporting this does seem to limit our ability to build expressive and optimized drivers in practice.

alevy

This looks good to me

lschuermann · 2024-05-31T15:08:35Z

@alevy Thanks for fixing the tests! Forgot to carry the removal of the num_armed to those.

github-actions bot added the HIL This affects a Tock HIL interface. label May 2, 2024

lschuermann mentioned this pull request May 2, 2024

kernel/hil/time,capsules/alarm: pad Ticks to 32 bit for predictable wrapping #3973

Merged

2 tasks

lschuermann force-pushed the dev/userspace-predictable-ticks-width-2 branch 2 times, most recently from 04bad49 to c8e4da7 Compare May 2, 2024 23:34

lschuermann commented May 2, 2024

View reviewed changes

github-actions bot assigned alexandruradovici May 3, 2024

This was referenced May 6, 2024

alarm: fix gettimeasticks arithmetic imprecision tock/libtock-c#411

Closed

timer: handle timer overflow tock/libtock-c#395

Open

bradjc previously approved these changes May 6, 2024

View reviewed changes

bradjc added the needs-rebase label May 14, 2024

ppannuto previously approved these changes May 22, 2024

View reviewed changes

bradjc added needs-rebase and removed needs-rebase labels May 28, 2024

lschuermann force-pushed the dev/userspace-predictable-ticks-width-2 branch from c8e4da7 to 8a4f730 Compare May 28, 2024 21:11

github-actions bot removed the HIL This affects a Tock HIL interface. label May 28, 2024

lschuermann requested review from ppannuto, bradjc and alevy May 28, 2024 21:16

lschuermann mentioned this pull request May 28, 2024

timer: Add support for 64-bit timers #3343

Open

2 tasks

alevy previously approved these changes May 31, 2024

View reviewed changes

alevy added P-Significant This is a substancial change that requires review from all core developers. and removed needs-rebase labels May 31, 2024

alevy removed request for ppannuto and bradjc May 31, 2024 02:13

Fix alarm test compilation issue

5480386

alevy dismissed their stale review via 5480386 May 31, 2024 02:19

alevy approved these changes May 31, 2024

View reviewed changes

lschuermann mentioned this pull request May 31, 2024

syscall: add notification mechanism #3252

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

lschuermann commented May 2, 2024 •

edited

lschuermann May 2, 2024

ppannuto May 7, 2024

lschuermann May 22, 2024

ppannuto May 22, 2024

bradjc left a comment

bradjc commented May 7, 2024

lschuermann commented May 8, 2024 •

edited

ppannuto left a comment

lschuermann commented May 28, 2024

alevy left a comment

lschuermann commented May 31, 2024

	// Ask the clock about a current reference once. This can incur a
	// volatile read, and this may not be optimized if done in a loop:
	let now = self.alarm.now();

	let expired_handler = \|expired: Expiration<A::Ticks>, process_id: &ProcessId\| {
	// This closure is run on every expired alarm, _after_ the `enter()`
	// closure on the Grant iterator has returned. We are thus not
	// risking reentrancy here.

	// Enter the app's grant again:
	let _ = self.app_alarms.enter(*process_id, \|alarm_state, upcalls\| {
	// Reset this app's alarm:
	alarm_state.expiration = None;

	// Keep track of the number of armed alarms, to allow for the
	// early-exit at the top of this function:
	self.num_armed.set(self.num_armed.get() - 1);

	// Deliver the upcall:
	upcalls
	.schedule_upcall(
	ALARM_CALLBACK_NUM,
	(
	now.into_u32_left_justified() as usize,
	expired.reference.wrapping_add(expired.dt).into_usize(),
	0,
	),
	)
	.ok();
	});

	// Proceed iteration across expirations:
	None::<()>
	};

	// Compute the earliest alarm, and invoke the `expired_handler` for
	// every expired alarm. This will issue a callback and reset the alarms
	// respectively.
	let res = Self::earliest_alarm(
	now,
	// Pass an interator of all non-None expirations:
	self.app_alarms.iter().filter_map(\|app\| {
	let process_id = app.processid();
	app.enter(\|alarm_state, _upcalls\| {
	if let Some(exp) = alarm_state.expiration {
	Some((exp, process_id, expired_handler))
	} else {
	None
	}
	})
	}),
	);

capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

Are you sure you want to change the base?

capsules/alarm: left-justify 32 bit ticks, re-architect alarm, add unit tests #3975

Conversation

lschuermann commented May 2, 2024 • edited

Pull Request Overview

Original Motivation

Alarm Driver Rewrite

Testing Strategy

TODO or Help Wanted

Documentation Updated

Formatting

lschuermann May 2, 2024

Choose a reason for hiding this comment

ppannuto May 7, 2024

Choose a reason for hiding this comment

lschuermann May 22, 2024

Choose a reason for hiding this comment

ppannuto May 22, 2024

Choose a reason for hiding this comment

bradjc left a comment

Choose a reason for hiding this comment

bradjc commented May 7, 2024

lschuermann commented May 8, 2024 • edited

ppannuto left a comment

Choose a reason for hiding this comment

lschuermann commented May 28, 2024

alevy left a comment

Choose a reason for hiding this comment

lschuermann commented May 31, 2024

lschuermann commented May 2, 2024 •

edited

lschuermann commented May 8, 2024 •

edited