Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add the optimize function to nodejs and async python #1257

Merged
merged 6 commits into from
May 20, 2024

Conversation

westonpace
Copy link
Contributor

The optimize function is pretty crucial for getting good performance when building a large scale dataset but it was only exposed in rust (many sync python users are probably doing this via to_lance today)

This PR adds the optimize function to nodejs and to python.

I left the function marked experimental because I think there will likely be changes to optimization (e.g. if we add features like "optimize on write"). I also only exposed the cleanup_older_than configuration parameter since this one is very commonly used and the rest have sensible defaults and we don't really know why we would recommend different values for these defaults anyways.

* you have added or modified 100,000 or more records or run more than 20 data
* modification operations.
*/
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the OptimizeOptions are a bit confusing.

how does the cleanup logic work? Does it clean any tables older than now() - cleanupOlderThanMs? or is cleanupOlderThanMs supposed to be a unix timestamp in milliseconds?

which one of these is valid?

await table.optimize({cleanupOlderThanMs: 1577836800000}) // 2020-01-01T00:00:00.000Z
await table.optimize({cleanupOlderThanMs: 3600000; }) //1 hour ago

If it's the latter, I think we should accept a JS Date object instead of number.

A Date would make the intention much clearer

await table.optimize({cleanupOlderThan: new Date("2024-03-01") }) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the latter. It would be nice if JS had a duration / timedelta type class. A user's intent is not usually "I want to cleanup all versions before march 1st' It is usually "I want to cleanup all versions older than 7 days". For example, in python it looks like this:

tbl.cleanup_old_versions(timedelta(days=7))

In rust it looks like this:

tbl.cleanup_old_versions(Duration::days(7))

I agree though that a number is a little vague and a date is probably better. I will switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user's intent is not usually "I want to cleanup all versions before march 1st' It is usually "I want to cleanup all versions older than 7 days". For example, in python it looks like this:

Yeah I agree that JS does make this a little more awkward than languages with a duration datatype.
Maybe we could add an example using date math

const olderThan = new Date(); // set it to now

// date math in JS is very ugly. 
olderThan.setDate(olderThan.getDate() - 7)) // set it to 7 days ago

tbl.cleanupOldVersions(olderThan)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased and changed to Date with your example using Date math:

export interface OptimizeOptions {
  /**
   * If set then all versions older than the given date
   * be removed.  The current version will never be removed.
   *
   * The default is 7 days
   *
   * @example
   * // Delete all versions older than 1 day
   * const olderThan = new Date();
   * olderThan.setDate(olderThan.getDate() - 1));
   * tbl.cleanupOlderVersions(olderThan);
   *
   * // Delete all versions except the current version
   * tbl.cleanupOlderVersions(new Date());
   */
  cleanupOlderThan: Date;
}

@westonpace westonpace force-pushed the feat/optimize-in-nodejs-python branch from 6c45ae2 to d35c78b Compare May 2, 2024 13:22
@westonpace westonpace force-pushed the feat/optimize-in-nodejs-python branch from d35c78b to 1e43f84 Compare May 14, 2024 13:06
@github-actions github-actions bot added enhancement New feature or request Python Python SDK Rust Rust related issues labels May 14, 2024
@westonpace westonpace force-pushed the feat/optimize-in-nodejs-python branch from 0d2997c to 1eda99b Compare May 17, 2024 12:50
@westonpace westonpace force-pushed the feat/optimize-in-nodejs-python branch from 1eda99b to f898d06 Compare May 20, 2024 13:54
@westonpace westonpace merged commit 4f512af into lancedb:main May 20, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python Python SDK Rust Rust related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants