You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?
The text was updated successfully, but these errors were encountered:
Yes, the bigquery-public-data.github_repos.languages include "only" 3 million repositories, but this sample size is statistically safe. For instance, in an election poll, they usually ask just a few thousand and scale that up to the whole population, e.g., the US population of 330 million. If you ask a few thousand, you have a quite big rate of error (a few percent), but if you ask 3 million, the error rate is extremely low.
Regarding the last updated in 2022, precisely Nov 27, 2022, well, that is an actual issue that I was not aware of yet. This means that I'm currently counting the Events correctly, but it does not include repositories after Nov 27, 2022. This alters the statistics significantly in the long run because we are only matching all the Events against a sample of 3M repos that were created before Nov 27, 2022. Hence, I have to find a new way to obtain a big enough sample size of repository language metadata that is up-to-date. Thanks for discovering and reporting this.
Okay, I did some research, thought for a while, and came up with a new idea. We can extract language information directly from the GH Archive Events because they are stored in the PullRequest Events. This amounts to a large sample size (millions of repositories) and they are up-to-date since we can count the language from the PullRequest events of the current quarter. The issue is that with this approach, we ignore any repository that has not seen any PullRequest over the last quarter (also not from any kind of bot such as Dependabot). I think it is a fair trade-off for now until we can maybe come up with a better idea.
The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?
The text was updated successfully, but these errors were encountered: