Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: Trying to get outputs from the best results for different models in different columns but getting same results for each #3991

Open
sm-ak-r33 opened this issue May 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@sm-ak-r33
Copy link

sm-ak-r33 commented May 20, 2024

Describe the feature you want to add to this project

I am trying to make a comparison with 240 timeseries of ML models like, linear reg, and some tree based models in the following code but the output are same for all of them

import pandas as pd
from pycaret.regression import *
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

class TimeSeriesSplitCustom(TimeSeriesSplit):
def init(self, n_splits=5, max_train_size=None, test_size=1, min_train_size=1):
super().init(n_splits=n_splits, max_train_size=max_train_size)
self.test_size = test_size
self.min_train_size = min_train_size

def split(self, X, y=None, groups=None):
    min_train_size = self.min_train_size
    test_size = self.test_size
    n_splits = self.n_splits
    n_samples = len(X)

    if (n_samples - min_train_size) / test_size >= n_splits:
        yield from super().split(X)
    else:
        shift = int(np.floor((n_samples - test_size - min_train_size) / (n_splits - 1)))
        start_test = n_samples - (n_splits * shift + test_size - shift)
        test_starts = range(start_test, n_samples - test_size + 1, shift)

        if start_test < min_train_size:
            raise ValueError(f"The start of the testing ({start_test}) is smaller than the minimum training samples ({min_train_size}).")

        indices = np.arange(n_samples)
        for test_start in test_starts:
            yield (indices[:test_start], indices[test_start:test_start + test_size])

class ModelTrainer:
def init(self, dataframe, unique_id_column, target_column, horizon):
self.df = dataframe
self.unique_id_column = unique_id_column
self.target_column = target_column
self.horizon = horizon
self.models = {}
self.model_ids = ['lr', 'rf', 'xgboost', 'lightgbm'] # Add more models as needed

def fit_models(self):
    unique_ids = self.df[self.unique_id_column].unique()
    for model_id in self.model_ids:
        self.models[model_id] = {}

    for model_id in self.model_ids:
        for uid in unique_ids:
            df_filtered = self.df[self.df[self.unique_id_column] == uid]
            df_filtered = df_filtered.drop(columns=[self.unique_id_column])  # Assuming no additional columns are needed for modeling

            # Setup data once before training models
            setup(data=df_filtered, target=self.target_column, verbose=False, session_id=123)

            # Splitting data using the custom time series split
            custom_cv = self.custom_cv_generator(df_filtered)
            
            model = create_model(model_id)
            tuned_model = tune_model(model, fold=custom_cv)  # Optional: Hyperparameter tuning
            self.models[model_id][uid] = tuned_model

def custom_cv_generator(self, df):
    return TimeSeriesSplitCustom(n_splits=5, max_train_size=len(df) - self.horizon, test_size=self.horizon)

def predict(self, dataframe):
    results = []
    for uid in self.models[self.model_ids[0]].keys():
        df_filtered = dataframe[dataframe[self.unique_id_column] == uid]
        df_filtered = df_filtered.drop(columns=[self.unique_id_column])  # Exclude ID for prediction
        last_rows = df_filtered.tail(self.horizon)  # Get the last 'horizon' rows for each UID

        forecasts = {'unique_id': [uid]*self.horizon}
        for model_id in self.model_ids:
            model = self.models[model_id][uid]
            forecast = []
            for i in range(self.horizon):
                prediction = predict_model(model, data=last_rows.iloc[[i]].reset_index(drop=True))
                forecast.append(prediction[self.target_column].values[0])
            forecasts[model_id] = forecast

        uid_results = pd.DataFrame(forecasts)
        results.append(uid_results)

    return pd.concat(results, ignore_index=True)

Describe your proposed solution

Is there something optimizing the code overall that needs to be set to false??

Describe alternatives you've considered, if relevant

I did not find any option to turn it off.

Additional context

No response

@sm-ak-r33 sm-ak-r33 added the enhancement New feature or request label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant