Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal coefficients, fixes #8847 #8885

Closed

Conversation

liakdimi1
Copy link

@liakdimi1 liakdimi1 commented Jul 22, 2023

Describe your change:

FIxes #8847
I found the optimal iterations and learning rate for achieving the closest coefficients
Στιγμιότυπο οθόνης (266)

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Documentation change?

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

@algorithms-keeper algorithms-keeper bot added enhancement This PR modified some existing files awaiting reviews This PR is ready to be reviewed labels Jul 22, 2023
Comment on lines +72 to +73
iterations = 250000
alpha = 0.000326
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't give the optimal coefficients. Like I said in my original issue, gradient descent is an approximation algorithm by definition, so tweaking the parameters will only give you a better approximation. My point with the original issue is that we should add a different algorithm, one that is a direct method rather than an iterative method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't fix the fact that the code is still calculating SSE incorrectly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on what do you mean by direct method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A direct method gives an exact solution. By contrast, an iterative method (such as gradient descent) gives increasingly accurate approximations. With an iterative method, you get closer and closer to the exact solution but you may never actually reach that exact solution.

For example, let's say you want to find the roots of a quadratic polynomial $ax^2 + bx + c$. You could use Newton's method to approximate the roots, refining your solutions to be more and more accurate—this is an iterative method. Alternatively, you could just use the quadratic formula to just get the exact solutions—this is a direct method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, explain with facts why the SSE is calculated incorrectly because the code does the predictions finds the deviation and takes the squares of the deviations, and sums the errors, which is the definition of SSE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already explained this with my code sample in the original issue, but sure, let's go over the math for the SSE function:

def sum_of_square_error(data_x, data_y, len_data, theta):
"""Return sum of square error for error calculation
:param data_x : contains our dataset
:param data_y : contains the output (result vector)
:param len_data : len of the dataset
:param theta : contains the feature vector
:return : sum of square error computed from given feature's
"""
prod = np.dot(theta, data_x.transpose())
prod -= data_y.transpose()
sum_elem = np.sum(np.square(prod))
error = sum_elem / (2 * len_data)
return error

Likes 60–62 are fine because they calculate the SSE exactly as we expect: it calculates the differences between the predicted response and actual response values, squares them, and then sums them:
$$\mathrm{SSE} = \sum_{i = 0}^{N} (\mathbf{x}_i^{\top} \boldsymbol{\theta} - \mathbf{y}_i)^2$$

However, for some reason, the function then divides the SSE by 2 * len_data. Dividing the SSE (sum of squared errors) by the number of datapoints gives you the MSE (mean squared error):
$$\mathrm{MSE} = \frac{1}{N} \sum_{i = 0}^{N} (\mathbf{x}_i^{\top} \boldsymbol{\theta} - \mathbf{y}_i)^2$$

Naturally, it follows that the "SSE" function is actually calculating half of the MSE because it's dividing the SSE by $2N$.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i appreciate your response

@tianyizheng02
Copy link
Contributor

If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

Please add the issue number to the description of your PR. You'll know it worked when you see the issue appear in this section on the right-hand side of this PR:

screenshot

@liakdimi1
Copy link
Author

As far as the calculation of the SSE this is what i found :

when it comes to the calculation of the Sum of Squared Errors (SSE), the division by 2 * len_data is a significant factor. This division is related to the derivative of the cost function during optimization, particularly when using gradient descent. By dividing the SSE by 2 * len_data, the derivative of the cost function (SSE) with respect to the model parameters (theta) becomes simpler and results in a cleaner expression.

Additionally, it is worth noting that the cost function commonly used in linear regression is the Mean Squared Error (MSE), which is calculated as SSE divided by 2 * len_data:

MSE = SSE / (2 * len_data)

When taking the derivative of MSE with respect to the model parameters (theta), you get:

d(MSE)/d(theta) = (1/len_data) * (theta.dot(data_x.transpose()) - data_y.transpose()).dot(data_x)

Here, the division by 2 in the original SSE equation cancels out during the differentiation process, and the term 1/len_data helps scale the gradients appropriately to handle datasets of different sizes.

Although the division by 2 * len_data doesn't directly impact the optimization process, it offers a more interpretable representation of the derivative of the cost function, which can be advantageous in certain contexts.

Whether to include the division by 2 * len_data in the cost function is a matter of preference. Some implementations choose to omit this division and adjust the learning rate accordingly to compensate for the scale change. However, to ensure consistency throughout the optimization process and related calculations, it's essential to make a deliberate choice and adhere to it.

So i don't believe your speculation is correct and the implementation of the SSE stands that's from me.

@tianyizheng02
Copy link
Contributor

This division is related to the derivative of the cost function during optimization, particularly when using gradient descent. By dividing the SSE by 2 * len_data, the derivative of the cost function (SSE) with respect to the model parameters (theta) becomes simpler and results in a cleaner expression.

It's perfectly fine to scale the loss function to make the math cleaner, but again the problem here is that this scaled version is no longer the SSE. SSE has a precise mathematical definition, so why should we call the scaled version the SSE when it's clearly not? Why not just call it scaled_sum_of_squared_errors or half_mean_squared_error?

Mean Squared Error (MSE), which is calculated as SSE divided by 2 * len_data

This is just factually wrong. MSE is defined to be SSE divided by len_data, with no division by 2. I pointed this out already in my previous comment, and you can easily verify this fact with Wikipedia or your search engine of choice:

$$\mathrm{MSE} = \frac{1}{n} \sum_{i = 1}^{n} (Y_i - \hat{Y}_i)^2$$

While I have seen half of the MSE used in the context of ML (because, as you pointed out, it can make the math cleaner), we should definitely avoid misleading people with a different definition because this repo is for educational purposes.

Anyway, you missed the whole point of my issue. My whole point was that the output of the function sum_of_square_error is not the SSE. This is still absolutely true, and I've already demonstrated it to you using both code and math. Plus, the fact that you've been referring to the implementation's loss function as the MSE shows that you also acknowledge that it's not the SSE. If you prefer MSE (or half of the MSE) because it's more interpretable or more computationally elegant, then that's perfectly fine—I personally couldn't care less as to whether this implementation uses SSE or MSE. However, in order to not be incorrect or misleading, the function name should still be changed to reflect what it actually calculates.

@tianyizheng02
Copy link
Contributor

Closing this PR since, as I've stated already, it doesn't actually resolve the linked issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting reviews This PR is ready to be reviewed enhancement This PR modified some existing files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

machine_learning/linear_regression.py doesn't give optimal coefficients
2 participants