How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

junluo-aspecta · 2024-03-21T02:07:22Z

The table 'bigquery-public-data.github_repos.languages' was last updated in 2022 and contains just over 3 million rows in total. How do you obtain the latest data and ensure that the data statistics reflect the entire GitHub ecosystem?

madnight · 2024-03-23T20:42:50Z

Hi @junluo-aspecta,

Yes, the bigquery-public-data.github_repos.languages include "only" 3 million repositories, but this sample size is statistically safe. For instance, in an election poll, they usually ask just a few thousand and scale that up to the whole population, e.g., the US population of 330 million. If you ask a few thousand, you have a quite big rate of error (a few percent), but if you ask 3 million, the error rate is extremely low.

Regarding the last updated in 2022, precisely Nov 27, 2022, well, that is an actual issue that I was not aware of yet. This means that I'm currently counting the Events correctly, but it does not include repositories after Nov 27, 2022. This alters the statistics significantly in the long run because we are only matching all the Events against a sample of 3M repos that were created before Nov 27, 2022. Hence, I have to find a new way to obtain a big enough sample size of repository language metadata that is up-to-date. Thanks for discovering and reporting this.

madnight · 2024-03-30T09:12:46Z

Hi @junluo-aspecta,

Okay, I did some research, thought for a while, and came up with a new idea. We can extract language information directly from the GH Archive Events because they are stored in the PullRequest Events. This amounts to a large sample size (millions of repositories) and they are up-to-date since we can count the language from the PullRequest events of the current quarter. The issue is that with this approach, we ignore any repository that has not seen any PullRequest over the last quarter (also not from any kind of bot such as Dependabot). I think it is a fair trade-off for now until we can maybe come up with a better idea.

f8adb52

madnight added the bug label Mar 23, 2024

madnight self-assigned this Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

junluo-aspecta commented Mar 21, 2024 •

edited

madnight commented Mar 23, 2024

madnight commented Mar 30, 2024

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

How do you obtain the latest data from table 'bigquery-public-data.github_repos.languages' #117

Comments

junluo-aspecta commented Mar 21, 2024 • edited

madnight commented Mar 23, 2024

madnight commented Mar 30, 2024

junluo-aspecta commented Mar 21, 2024 •

edited