-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add the optimize function to nodejs and async python #1257
feat: add the optimize function to nodejs and async python #1257
Conversation
* you have added or modified 100,000 or more records or run more than 20 data | ||
* modification operations. | ||
*/ | ||
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the OptimizeOptions
are a bit confusing.
how does the cleanup logic work? Does it clean any tables older than now() - cleanupOlderThanMs
? or is cleanupOlderThanMs
supposed to be a unix timestamp in milliseconds?
which one of these is valid?
await table.optimize({cleanupOlderThanMs: 1577836800000}) // 2020-01-01T00:00:00.000Z
await table.optimize({cleanupOlderThanMs: 3600000; }) //1 hour ago
If it's the latter, I think we should accept a JS Date
object instead of number
.
A Date
would make the intention much clearer
await table.optimize({cleanupOlderThan: new Date("2024-03-01") })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the latter. It would be nice if JS had a duration / timedelta type class. A user's intent is not usually "I want to cleanup all versions before march 1st' It is usually "I want to cleanup all versions older than 7 days". For example, in python it looks like this:
tbl.cleanup_old_versions(timedelta(days=7))
In rust it looks like this:
tbl.cleanup_old_versions(Duration::days(7))
I agree though that a number
is a little vague and a date is probably better. I will switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user's intent is not usually "I want to cleanup all versions before march 1st' It is usually "I want to cleanup all versions older than 7 days". For example, in python it looks like this:
Yeah I agree that JS does make this a little more awkward than languages with a duration datatype.
Maybe we could add an example using date math
const olderThan = new Date(); // set it to now
// date math in JS is very ugly.
olderThan.setDate(olderThan.getDate() - 7)) // set it to 7 days ago
tbl.cleanupOldVersions(olderThan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased and changed to Date
with your example using Date
math:
export interface OptimizeOptions {
/**
* If set then all versions older than the given date
* be removed. The current version will never be removed.
*
* The default is 7 days
*
* @example
* // Delete all versions older than 1 day
* const olderThan = new Date();
* olderThan.setDate(olderThan.getDate() - 1));
* tbl.cleanupOlderVersions(olderThan);
*
* // Delete all versions except the current version
* tbl.cleanupOlderVersions(new Date());
*/
cleanupOlderThan: Date;
}
6c45ae2
to
d35c78b
Compare
d35c78b
to
1e43f84
Compare
0d2997c
to
1eda99b
Compare
1eda99b
to
f898d06
Compare
The optimize function is pretty crucial for getting good performance when building a large scale dataset but it was only exposed in rust (many sync python users are probably doing this via to_lance today)
This PR adds the optimize function to nodejs and to python.
I left the function marked experimental because I think there will likely be changes to optimization (e.g. if we add features like "optimize on write"). I also only exposed the
cleanup_older_than
configuration parameter since this one is very commonly used and the rest have sensible defaults and we don't really know why we would recommend different values for these defaults anyways.