Optimal coefficients, fixes #8847 #8885

liakdimi1 · 2023-07-22T16:11:16Z

Describe your change:

FIxes #8847
I found the optimal iterations and learning rate for achieving the closest coefficients

Add an algorithm?
Fix a bug or typo in an existing algorithm?
Documentation change?

Checklist:

tianyizheng02 · 2023-07-22T17:15:46Z

machine_learning/linear_regression.py

+    iterations = 250000
+    alpha = 0.000326


This won't give the optimal coefficients. Like I said in my original issue, gradient descent is an approximation algorithm by definition, so tweaking the parameters will only give you a better approximation. My point with the original issue is that we should add a different algorithm, one that is a direct method rather than an iterative method.

This also doesn't fix the fact that the code is still calculating SSE incorrectly.

can you elaborate on what do you mean by direct method?

A direct method gives an exact solution. By contrast, an iterative method (such as gradient descent) gives increasingly accurate approximations. With an iterative method, you get closer and closer to the exact solution but you may never actually reach that exact solution.

For example, let's say you want to find the roots of a quadratic polynomial $ax^2 + bx + c$. You could use Newton's method to approximate the roots, refining your solutions to be more and more accurate—this is an iterative method. Alternatively, you could just use the quadratic formula to just get the exact solutions—this is a direct method.

also, explain with facts why the SSE is calculated incorrectly because the code does the predictions finds the deviation and takes the squares of the deviations, and sums the errors, which is the definition of SSE.

I already explained this with my code sample in the original issue, but sure, let's go over the math for the SSE function:

Python/machine_learning/linear_regression.py

Lines 52 to 64 in 9e08c77

def sum_of_square_error(data_x, data_y, len_data, theta):

"""Return sum of square error for error calculation

:param data_x : contains our dataset

:param data_y : contains the output (result vector)

:param len_data : len of the dataset

:param theta : contains the feature vector

:return : sum of square error computed from given feature's

"""

prod = np.dot(theta, data_x.transpose())

prod -= data_y.transpose()

sum_elem = np.sum(np.square(prod))

error = sum_elem / (2 * len_data)

return error

Likes 60–62 are fine because they calculate the SSE exactly as we expect: it calculates the differences between the predicted response and actual response values, squares them, and then sums them:
$$\mathrm{SSE} = \sum_{i = 0}^{N} (\mathbf{x}_i^{\top} \boldsymbol{\theta} - \mathbf{y}_i)^2$$

However, for some reason, the function then divides the SSE by 2 * len_data. Dividing the SSE (sum of squared errors) by the number of datapoints gives you the MSE (mean squared error):
$$\mathrm{MSE} = \frac{1}{N} \sum_{i = 0}^{N} (\mathbf{x}_i^{\top} \boldsymbol{\theta} - \mathbf{y}_i)^2$$

Naturally, it follows that the "SSE" function is actually calculating half of the MSE because it's dividing the SSE by $2N$.

i appreciate your response

tianyizheng02 · 2023-07-22T17:19:42Z

If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

Please add the issue number to the description of your PR. You'll know it worked when you see the issue appear in this section on the right-hand side of this PR:

liakdimi1 · 2023-07-24T18:24:53Z

As far as the calculation of the SSE this is what i found :

when it comes to the calculation of the Sum of Squared Errors (SSE), the division by 2 * len_data is a significant factor. This division is related to the derivative of the cost function during optimization, particularly when using gradient descent. By dividing the SSE by 2 * len_data, the derivative of the cost function (SSE) with respect to the model parameters (theta) becomes simpler and results in a cleaner expression.

Additionally, it is worth noting that the cost function commonly used in linear regression is the Mean Squared Error (MSE), which is calculated as SSE divided by 2 * len_data:

MSE = SSE / (2 * len_data)

When taking the derivative of MSE with respect to the model parameters (theta), you get:

d(MSE)/d(theta) = (1/len_data) * (theta.dot(data_x.transpose()) - data_y.transpose()).dot(data_x)

Here, the division by 2 in the original SSE equation cancels out during the differentiation process, and the term 1/len_data helps scale the gradients appropriately to handle datasets of different sizes.

Although the division by 2 * len_data doesn't directly impact the optimization process, it offers a more interpretable representation of the derivative of the cost function, which can be advantageous in certain contexts.

Whether to include the division by 2 * len_data in the cost function is a matter of preference. Some implementations choose to omit this division and adjust the learning rate accordingly to compensate for the scale change. However, to ensure consistency throughout the optimization process and related calculations, it's essential to make a deliberate choice and adhere to it.

So i don't believe your speculation is correct and the implementation of the SSE stands that's from me.

tianyizheng02 · 2023-07-24T21:57:06Z

This division is related to the derivative of the cost function during optimization, particularly when using gradient descent. By dividing the SSE by 2 * len_data, the derivative of the cost function (SSE) with respect to the model parameters (theta) becomes simpler and results in a cleaner expression.

It's perfectly fine to scale the loss function to make the math cleaner, but again the problem here is that this scaled version is no longer the SSE. SSE has a precise mathematical definition, so why should we call the scaled version the SSE when it's clearly not? Why not just call it scaled_sum_of_squared_errors or half_mean_squared_error?

Mean Squared Error (MSE), which is calculated as SSE divided by 2 * len_data

This is just factually wrong. MSE is defined to be SSE divided by len_data, with no division by 2. I pointed this out already in my previous comment, and you can easily verify this fact with Wikipedia or your search engine of choice:

$$\mathrm{MSE} = \frac{1}{n} \sum_{i = 1}^{n} (Y_i - \hat{Y}_i)^2$$

While I have seen half of the MSE used in the context of ML (because, as you pointed out, it can make the math cleaner), we should definitely avoid misleading people with a different definition because this repo is for educational purposes.

Anyway, you missed the whole point of my issue. My whole point was that the output of the function sum_of_square_error is not the SSE. This is still absolutely true, and I've already demonstrated it to you using both code and math. Plus, the fact that you've been referring to the implementation's loss function as the MSE shows that you also acknowledge that it's not the SSE. If you prefer MSE (or half of the MSE) because it's more interpretable or more computationally elegant, then that's perfectly fine—I personally couldn't care less as to whether this implementation uses SSE or MSE. However, in order to not be incorrect or misleading, the function name should still be changed to reflect what it actually calculates.

tianyizheng02 · 2023-08-01T01:10:31Z

Closing this PR since, as I've stated already, it doesn't actually resolve the linked issue

liakdimi1 added 2 commits July 22, 2023 18:21

Fixes TheAlgorithms#8847

51c8190

Fixes coefficients

4df600e

algorithms-keeper bot added enhancement This PR modified some existing files awaiting reviews This PR is ready to be reviewed labels Jul 22, 2023

tianyizheng02 reviewed Jul 22, 2023

View reviewed changes

tianyizheng02 closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal coefficients, fixes #8847 #8885

Optimal coefficients, fixes #8847 #8885

liakdimi1 commented Jul 22, 2023 •

edited

tianyizheng02 Jul 22, 2023

tianyizheng02 Jul 22, 2023

liakdimi1 Jul 22, 2023

tianyizheng02 Jul 22, 2023

liakdimi1 Jul 22, 2023

tianyizheng02 Jul 22, 2023 •

edited

liakdimi1 Jul 24, 2023

tianyizheng02 commented Jul 22, 2023

liakdimi1 commented Jul 24, 2023

tianyizheng02 commented Jul 24, 2023

tianyizheng02 commented Aug 1, 2023

	def sum_of_square_error(data_x, data_y, len_data, theta):
	"""Return sum of square error for error calculation
	:param data_x : contains our dataset
	:param data_y : contains the output (result vector)
	:param len_data : len of the dataset
	:param theta : contains the feature vector
	:return : sum of square error computed from given feature's
	"""
	prod = np.dot(theta, data_x.transpose())
	prod -= data_y.transpose()
	sum_elem = np.sum(np.square(prod))
	error = sum_elem / (2 * len_data)
	return error

Optimal coefficients, fixes #8847 #8885

Optimal coefficients, fixes #8847 #8885

Conversation

liakdimi1 commented Jul 22, 2023 • edited

Describe your change:

Checklist:

tianyizheng02 Jul 22, 2023

Choose a reason for hiding this comment

tianyizheng02 Jul 22, 2023

Choose a reason for hiding this comment

liakdimi1 Jul 22, 2023

Choose a reason for hiding this comment

tianyizheng02 Jul 22, 2023

Choose a reason for hiding this comment

liakdimi1 Jul 22, 2023

Choose a reason for hiding this comment

tianyizheng02 Jul 22, 2023 • edited

Choose a reason for hiding this comment

liakdimi1 Jul 24, 2023

Choose a reason for hiding this comment

tianyizheng02 commented Jul 22, 2023

liakdimi1 commented Jul 24, 2023

tianyizheng02 commented Jul 24, 2023

tianyizheng02 commented Aug 1, 2023

liakdimi1 commented Jul 22, 2023 •

edited

tianyizheng02 Jul 22, 2023 •

edited