-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All Train Test Split Changes #1381
base: without-ts-split
Are you sure you want to change the base?
Conversation
provider/clickhouse.go
Outdated
colTypes, err := store.getValueColumnTypes(trainingTestSplitName) | ||
fmt.Printf("these are the column types: %v\n", colTypes) | ||
if err != nil { | ||
return nil, nil, nil, fmt.Errorf("could not get column types: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this happened, dropFunc would never be called right?
Maybe force the caller to call create and defer close then get the iterators. Would also make the function signature a little cleaner.
provider/clickhouse.go
Outdated
|
||
// return callback to drop view | ||
dropFunc := func() error { | ||
// two queries to drop split and row number table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of date comment?
provider/clickhouse.go
Outdated
testSize float32, | ||
shuffle bool, | ||
randomState int, | ||
) (TrainingSetIterator, TrainingSetIterator, func() error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont love having this many things returned, I have advice in a comment below though
api/main.go
Outdated
for { | ||
req, err := stream.Recv() | ||
if err == io.EOF { | ||
// Client has closed the stream, close the downstream stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't need the comment. self explanatory via code + logger message.
api/main.go
Outdated
return err | ||
} | ||
|
||
// Forward the request to the downstream service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above.
api/main.go
Outdated
|
||
resp, err := clientStream.Recv() | ||
if err == io.EOF { | ||
// End of stream from downstream service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same
@@ -10,6 +10,7 @@ package featureform.serving.proto; | |||
|
|||
service Feature { | |||
rpc TrainingData(TrainingDataRequest) returns (stream TrainingDataRow) {} | |||
rpc TrainTestSplit(stream TrainTestSplitRequest) returns (stream BatchTrainTestSplitResponse) {} | |||
rpc TrainingDataColumns(TrainingDataColumnsRequest) returns (TrainingColumns) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: try to keep formatting changes to separate PRs otherwise it forces the reader to diff formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hit the gear icon in github -> "hide whitespaace"
proto/serving.proto
Outdated
float train_size = 4; | ||
bool shuffle = 5; | ||
int32 random_state = 6; // Seed for shuffling, if shuffle is true | ||
RequestType request_type = 7; // Specify the type of data being requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just name it RequestDataType request_data_type
to avoid needing the clarifying comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can remove these comments, they were generated
client/src/featureform/serving.py
Outdated
(This functionality is currently only available for Clickhouse). | ||
|
||
Splits an existing training set into training and testing iterators. The split is processed on the underlying | ||
provider and calculated and serving time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "calculated and serving time." with "calculated at serving time."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
train (Iterator): An iterator for training values. | ||
test (Iterator): An iterator for testing values. | ||
""" | ||
if batch_size < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the min batch size be configurable at some point? if so I'd replace this magic number with a const + interpolate the value error msg with the const.
if not ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min batch size has to be 1 right? effectively meaning no batch
client/src/featureform/serving.py
Outdated
variant = self._stream.version | ||
stub = self._stream._stub | ||
model = self._stream.model if hasattr(self._stream, "model") else None | ||
if random_state is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this random_state
conditional be a part of the first if check on line 533?
also it sets the random_state
to zero, but that's an error state value in the first conditional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also really good catch
client/src/featureform/serving.py
Outdated
|
||
@staticmethod | ||
def validate_test_size(test_size, train_size): | ||
if test_size > 1 or test_size < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: anything we can do to clean this function up? it's a rough read.
for example we could reorder the first IF to be
if test_size < 0 or 1 < test_size
to adhere to natural left-right reading pattern (unless you're optimizing for likeliest condition?)
return type_mapping[value.WhichOneof("value")] | ||
|
||
|
||
def get_numpy_array_type(types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd throw a parameterized unit test at this function. reads like it could easily be a source of a hidden logic bug.
if self._np_type_parser is None and data.rows: | ||
self._np_type_parser = _NpProtoTypeParser.init_types(data.rows[0]) | ||
|
||
for i, row in enumerate(data.rows): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not using the index var "i". can just drop it like:
for row in enumerate(data.rows)
provider/clickhouse.go
Outdated
randomState int, | ||
) (string, error) { | ||
// Generate unique suffix for the view names | ||
tableNameSuffix := fmt.Sprintf("%s_%d_%t_%d", trainingSetTable, int(testSize*100), shuffle, randomState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int(testSize*100)
why 100, is it not better to use rand int here?
|
||
req, err := stream.Recv() | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int: log that the error happened at recv()
serving/serving.go
Outdated
} | ||
serv.Logger.Debugw("Get Training Set From Store", "name", name, "variant", variant) | ||
return store.GetTrainingSet(provider.ResourceID{Name: name, Variant: variant}) | ||
} | ||
|
||
func (serv *FeatureServer) getTrainingSetTestSplitIterator(name, variant string, testSize float32, shuffle bool, randomState int) (provider.TrainingSetIterator, provider.TrainingSetIterator, func() error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: too many returns here. maybe use a wrapper SplitResp
struct or similar.
move client to a separate file and refactor some more clean up small one even more clean up remove unused function small comments address ahmad's cpomment ts split stuff Added some optimizations and comments some more changes get rid of logs rename protos fix tests rename some things cleanup some missed name change Delete grpc_debug.log one more name change small fixes stupid alllll the changes i hate changing interfaces one more some more more misses whoops
@@ -214,6 +214,15 @@ func (it *bqGenericTableIterator) Values() GenericRecord { | |||
|
|||
func (it *bqGenericTableIterator) Columns() []string { | |||
var columns []string | |||
// As the documentation for bigquery.Schema notes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore this
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## without-ts-split #1381 +/- ##
===================================================
Coverage ? 55.71%
===================================================
Files ? 190
Lines ? 24780
Branches ? 836
===================================================
Hits ? 13806
Misses ? 9441
Partials ? 1533 ☔ View full report in Codecov by Sentry. |
@@ -82,13 +79,182 @@ func (serv *FeatureServer) TrainingData(req *pb.TrainingDataRequest, stream pb.F | |||
return nil | |||
} | |||
|
|||
type splitContext struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not a fan of the struct name since context
is a strong convention in go. maybe SplitMetadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree and thought about it a bit too but with it being an internal struct and literally being the context for all the functions I thought it would do.
|
||
for { | ||
if isTrainFinished && isTestFinished { | ||
// If both iterators are finished, we can close the stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete the comment imo. the code is self-documenting.
featureObserver := serv.Metrics.BeginObservingTrainingServe(name, variant) | ||
defer featureObserver.Finish() | ||
|
||
ctx := splitContext{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ctx clashes with a strong go convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeahhh i thought about that, I'm a big advocate of just typing things out so I'll do that
} | ||
} | ||
|
||
response := &pb.BatchTrainTestSplitResponse{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't this logically a request? I feel like the naming is a bit off. like you send requests, and receive responses but currently the proto looks like this:
Send(*BatchTrainTestSplitResponse) error
Recv() (*TrainTestSplitRequest, error)
but resolve if I'm missing some context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeahhh it's kindaaaa weird
so the flow is this ->
we make a trainsplitreq -> recieving by api/main -> send to serving.go -> recieve req by serving -> create response by serving and SEND response -> recv resp on main -> send resp on main
*ctx.isTrainFinished = true | ||
} | ||
|
||
if *ctx.isTestFinished && *ctx.isTrainFinished { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: can a request be just a testType or a trainType, or is it always both types running together?
the reason I ask is because it looks like we need both conditions to close out, but lines 230 and 232 set them independently of each other, so one could still be false?
Ignore If I'm misreading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good q.
a request is either a testtype OR a traintype, never both, if one of them finishes we return an iterator finished response so that the iteration is stopped but the stream isn't closed
} | ||
store, err := p.AsOfflineStore() | ||
if err != nil { | ||
// This means that the provider of the training set isn't an offline store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like an important distinction. tbh I'd take the comment and write it to the logger.
Description
All Training Set Split changes thus far
Type of change
Does this correspond to an open issue?
Select type(s) of change
Checklist: