Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added embed2 which returns a table structure #1186

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ns1000
Copy link
Contributor

@ns1000 ns1000 commented Nov 25, 2023

The proposed change added new wrapper for pgml.embed, this will instead of return a single row, return a table structure which is very useful for batch processing strings.

Example usage is

select * from 
    pgml.embed2('all-MiniLM-L6-v2', (select array_agg(phrase) from (select * from phrases limit 10)));

To import the function is as follows:

CREATE OR REPLACE FUNCTION pgml."embed2"(
    transformer TEXT,
    inputs TEXT[],
    kwargs JSONB DEFAULT '{}'
) RETURNS TABLE (text TEXT, embedding real[]) 
    LANGUAGE c IMMUTABLE STRICT PARALLEL SAFE AS 'MODULE_PATHNAME', 'embed_batch2_wrapper';

…ould segfault after a client session which used pgml command closes. The issue can be identified in postgres log files with the line 'arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit'
Copy link
Contributor

@montanalow montanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a migration to create the function for people who upgrade, e.g.
https://github.com/postgresml/postgresml/blob/master/pgml-extension/sql/pgml--2.7.13--2.8.0.sql

@@ -558,6 +558,26 @@ pub fn embed_batch(
}
}

#[cfg(all(feature = "python", not(feature = "use_as_lib")))]
#[pg_extern(immutable, parallel_safe, name = "embed2")]
pub fn embed_batch2<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rough thoughts, without running the code on some examples.

I think we should name this SQL embed_3 and Rust embed_batch_3 with the goal of establishing this as the 3.0 embed API, as well as a pattern for releasing 3.0 APIs early as we developing them in an alpha state (with potentially breaking changes, where we completely drop them in 3.1 in favor of the newly established default behavior).

Your example convinces me that batch APIs should return a table, but I think that table's rows should be JSONB with {id, embedding} keys (at least), unless there is a significant performance implication on that front. My thinking is that embedding models are getting more complicated and now some take JSON rather than TEXT for inputs including a prompt. It would be nice to have an optional id in the input JSON, and if it's not present, then just return the entire input JSON as the id, which acts just like your TEXT as the key.

Final thought is that kwargs is JSONB currently, which works well with the underlying Python dependencies, but I'd like to structure it as much as possible for final 3.0. We should find a way to flag this obviously as an alpha API, that will be broken and eventually dropped when a final version is available.

@ns1000
Copy link
Contributor Author

ns1000 commented Nov 25, 2023

So it turns out the batching is not really necessary to achieve speed. When running on CPU within the Postgres python VM, you really need to torch.set_num_threads(1) in order to get the maximum speed. Leaving it to the default value, which is the number of CPUs was creating the slow down problems for me. It will still use all the CPUs when use threads=1.

I am using a debian system, with python 3.11 and postgres 16 to test all this.

@montanalow
Copy link
Contributor

So it turns out the batching is not really necessary to achieve speed. When running on CPU within the Postgres python VM, you really need to torch.set_num_threads(1) in order to get the maximum speed. Leaving it to the default value, which is the number of CPUs was creating the slow down problems for me. It will still use all the CPUs when use threads=1.

I am using a debian system, with python 3.11 and postgres 16 to test all this.

Ah, so this is actually another hit on #1161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants