-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Test Split - Post Merge Review #1362
base: Test_split_review
Are you sure you want to change the base?
Conversation
TEST = 2; // Client is requesting test data | ||
} | ||
|
||
message TrainingTestSplitResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't the happiest with this,
in retrospect and initalized probably isn't necessary.
iterator done means first iterator is done and it's a response we send so the client knows to finish iterating on the first iterator but keep the stream open; i'll add a comment
if random_state is None: | ||
random_state = 0 | ||
|
||
train, test = TrainingSetTestSplit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entry point to splitting
|
||
|
||
@dataclass | ||
class TrainingSetSplitDetails: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core Python logic
|
||
return self | ||
|
||
def send_request(self, request_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment here and return value.
def __iter__(self): | ||
return self | ||
|
||
def __next__(self) -> Tuple[np.ndarray, np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core iterator logic that interacts with the backend
if isTestFinished && isTrainFinished { | ||
// If both iterators are finished, we can close the stream | ||
serv.Logger.Infow("Both iterators are finished, closing stream") | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning nil closes the stream
} | ||
} | ||
|
||
func (serv *FeatureServer) handleSplitInitializeRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually might not end up needing this
@@ -82,13 +81,170 @@ func (serv *FeatureServer) TrainingData(req *pb.TrainingDataRequest, stream pb.F | |||
return nil | |||
} | |||
|
|||
func (serv *FeatureServer) TrainingTestSplit(stream pb.Feature_TrainingTestSplitServer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core logic I would like to be reviewed
@@ -287,6 +287,7 @@ type OfflineStore interface { | |||
CreateTrainingSet(TrainingSetDef) error | |||
UpdateTrainingSet(TrainingSetDef) error | |||
GetTrainingSet(id ResourceID) (TrainingSetIterator, error) | |||
GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need eyes on this
@@ -954,6 +955,56 @@ func (store *clickHouseOfflineStore) CreateTrainingSet(def TrainingSetDef) error | |||
return nil | |||
} | |||
|
|||
func (store *clickHouseOfflineStore) CreateTrainingTestSplit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what I really really want a review on
Squashed commit of the following: commit 0c88d69 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 17:45:31 2024 -0800 remove unused function commit 2d39369 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 17:01:38 2024 -0800 even more clean up commit 0ec53b5 Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 16:55:05 2024 -0800 small one commit 59caf8d Author: Ali Olfat <ali@featureform.com> Date: Wed Feb 28 16:52:02 2024 -0800 some more clean up commit bfa318c Author: Ali Olfat <ali@featureform.com> Date: Sun Feb 18 18:00:34 2024 -0800 move client to a separate file and refactor
logger.Errorw("Failed to get training set iterator", "Error", err) | ||
return err | ||
} | ||
defer dropViews() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not totally convinced I need this yet:
What i was thinking
dataset = get_training_set(ts.name, ts.var)
train, test = dataset.training_test_split(random_state=0)
if you don't drop the views after these are consumed, its extra stuff in their provider -- which really isn't a big deal -- and also running dataset.training_test_split(random_state=0) won't be a new random set
I could probably get rid of the view dropping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few comments; overall great work!
One question, can the user convert train
and test
into pandas dataframe? if yes, can we include it in docs?
@@ -1,3 +1,4 @@ | |||
import featureform.resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: usually, there is an order for imports
raise ValueError("test_size must be between 0 and 1") | ||
if train_size > 1 or train_size < 0: | ||
raise ValueError("train_size must be between 0 and 1") | ||
if test_size != 0 and train_size != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can remove this conditional by moving the nested conditional to the both at the first level.
if test_size == 0 and train_size != 0:
test_size = 1 - train_size
if test_size != 0 and train_size == 0:
train_size = 1 - test_size
if test_size + train_size != 1:
raise ValueError("test_size + train_size must equal 1")
return test_size, train_size
break | ||
|
||
# Process and store the row data | ||
from featureform.serving import Row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this import is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
circular imports
@@ -0,0 +1,237 @@ | |||
from featureform.serving import Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: import ordering
|
||
|
||
def response(req_type, iterator_done): | ||
if req_type == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea, what the numbers mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shows it right under
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i fixed it
if err != nil { | ||
return | ||
} | ||
for set.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be some sort of check?
@@ -91,3 +96,106 @@ func createClickHouseDatabase(c pc.ClickHouseConfig) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func TestTrainingSet(t *testing.T) { | |||
t.Skip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are getting skipped?
return err | ||
} | ||
default: | ||
if err := serv.handleSplitDataRequest(stream, req, &trainIter, &testIter, &isTestFinished, &isTrainFinished, logger); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is kind of weird that all these variables are getting updated within this method especially for the isTestFinished and isTrainFinished
@@ -780,6 +781,11 @@ func (store *memoryOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetItera | |||
} | |||
return data.(trainingRows).Iterator(), nil | |||
} | |||
|
|||
func (store *memoryOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) { | |||
return nil, nil, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have a not implemented too?
@@ -2264,6 +2264,10 @@ func (spark *SparkOfflineStore) GetTrainingSet(id ResourceID) (TrainingSetIterat | |||
return fileStoreGetTrainingSet(id, spark.Store, spark.Logger) | |||
} | |||
|
|||
func (spark *SparkOfflineStore) GetTrainingSetTestSplit(id ResourceID, testSize float32, shuffle bool, randomState int) (TrainingSetIterator, TrainingSetIterator, func() error, error) { | |||
return nil, nil, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a not implemented
or rather not supported for Spark
error message?
Description
Type of change
Does this correspond to an open issue?
Select type(s) of change
Checklist: