Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

Open
junluo-aspecta opened this issue Mar 21, 2024 · 2 comments
Assignees
Labels

Comments

@junluo-aspecta
Copy link

junluo-aspecta commented Mar 21, 2024

The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?
20240321-110718

@madnight
Copy link
Owner

Hi @junluo-aspecta,

Yes, the bigquery-public-data.github_repos.languages include "only" 3 million repositories, but this sample size is statistically safe. For instance, in an election poll, they usually ask just a few thousand and scale that up to the whole population, e.g., the US population of 330 million. If you ask a few thousand, you have a quite big rate of error (a few percent), but if you ask 3 million, the error rate is extremely low.

Regarding the last updated in 2022, precisely Nov 27, 2022, well, that is an actual issue that I was not aware of yet. This means that I'm currently counting the Events correctly, but it does not include repositories after Nov 27, 2022. This alters the statistics significantly in the long run because we are only matching all the Events against a sample of 3M repos that were created before Nov 27, 2022. Hence, I have to find a new way to obtain a big enough sample size of repository language metadata that is up-to-date. Thanks for discovering and reporting this.

@madnight madnight added the bug label Mar 23, 2024
@madnight madnight self-assigned this Mar 23, 2024
@madnight
Copy link
Owner

Hi @junluo-aspecta,

Okay, I did some research, thought for a while, and came up with a new idea. We can extract language information directly from the GH Archive Events because they are stored in the PullRequest Events. This amounts to a large sample size (millions of repositories) and they are up-to-date since we can count the language from the PullRequest events of the current quarter. The issue is that with this approach, we ignore any repository that has not seen any PullRequest over the last quarter (also not from any kind of bot such as Dependabot). I think it is a fair trade-off for now until we can maybe come up with a better idea.

f8adb52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants