common-crawl

Star

Here are 39 public repositories matching this topic...

commoncrawl / cc-crawl-statistics

Star

Statistics of Common Crawl monthly archives mined from URL index files

statistics commoncrawl common-crawl

Updated Jun 3, 2024
Python

cisnlp / GlotCC

Star

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

crawler multlingual corpus-linguistics glot language-identification commoncrawl common-crawl glotcc multilingual-dataset

Updated May 31, 2024

ilyankou / cc-gpx

Star

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

gpx hiking common-crawl

Updated May 29, 2024
Jupyter Notebook

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

html parser json information-retrieval csv string simd dataset string-manipulation sorting-algorithms beautifulsoup pattern-recognition ndjson substring string-matching string-search string-parsing common-crawl laion

Updated May 18, 2024
C++

commoncrawl / cc-webgraph

Star

Tools to construct and process webgraphs from Common Crawl data

pagerank webgraph commoncrawl common-crawl centrality-measures webgraph-framework

Updated May 2, 2024
Java

toimik / CommonCrawl

Star

Common Crawl's processing tools

warc wat wet commoncrawl common-crawl warc-files wat-files common-crawl-data wet-files

Updated May 2, 2024
C#

commoncrawl / cc-pyspark

Star

Process Common Crawl data with Python and Spark

spark pyspark sparksql wet commoncrawl common-crawl warc-files wat-files

Updated Apr 8, 2024
Python

crissyfield / troll-a

Star

Drill into WARC web archives

security internet-archive command-line-tool warc security-tools common-crawl

Updated Jan 4, 2024
Go

oscar-project / ungoliant

Star

🕷️ The pipeline for the OSCAR corpus

nlp crawler corpus-linguistics fasttext oscar commoncrawl common-crawl language-classification

Updated Dec 18, 2023
Rust

commoncrawl / news-crawl

Star

News crawling with StormCrawler - stores content as WARC

crawler news web-crawler apache-storm warc commoncrawl common-crawl storm-crawler

Updated Dec 13, 2023
Java

oscar-project / oscar-website

Star

The website of the Oscar Project

nlp website machine-learning hugo language-model common-crawl

Updated Nov 9, 2023
TeX

connor-marchand / gau-python

Star

This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau

scraper wayback-machine alienvault common-crawl gau-python

Updated Jul 22, 2023
Python

neil-zt / common-crawl-client

Star

A Common Crawl client example for scraping specific websites.

common-crawl scraping-python comcrawl

Updated Jun 27, 2023
Jupyter Notebook

michaelharms / comcrawl

Star

A python utility for downloading Common Crawl data

python data deep-learning scraping commoncrawl common-crawl training-dataset

Updated Jun 8, 2023
Python

mwoss / mors

Star

Application of topic models for information retrieval and search engine optimization.

python search search-engine crawler django scrapy gensim lda tfidf hacktoberfest doc2vec common-crawl

Updated Oct 24, 2022
Python

commoncrawl / cc-notebooks

Star

Various Jupyter notebooks about Common Crawl data

jupyter-notebook aws-athena commoncrawl common-crawl webarchiving webgraph-framework

Updated Jun 2, 2022
Jupyter Notebook

hadrianw / abracabra

Star

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

rust search-engine rust-lang adblock warc adblocking common-crawl

Updated May 31, 2022
Rust

bminixhofer / gerpt2

Star

German small and large versions of GPT2.

nlp machine-learning german language-model common-crawl gpt2

Updated May 11, 2022
Python

Dahouabdelhalim / Discourse-marksers-and-Web-crawling

Star

Discourse Markers identification in French Language

deep-learning dataset web-crawling unitexgramlab common-crawl french-language discourse-markers

Updated Mar 5, 2022
HTML

IBM / cc-dbp

Star

A dataset for knowledge base population research using Common Crawl and DBpedia.

dbpedia common-crawl ibm-research-ai knowledge-base-population

Updated Jan 27, 2022
Java

Improve this page

Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common-crawl

Here are 39 public repositories matching this topic...

commoncrawl / cc-crawl-statistics

cisnlp / GlotCC

ilyankou / cc-gpx

ashvardanian / StringZilla

commoncrawl / cc-webgraph

toimik / CommonCrawl

commoncrawl / cc-pyspark

crissyfield / troll-a

oscar-project / ungoliant

commoncrawl / news-crawl

oscar-project / oscar-website

connor-marchand / gau-python

neil-zt / common-crawl-client

michaelharms / comcrawl

mwoss / mors

commoncrawl / cc-notebooks

hadrianw / abracabra

bminixhofer / gerpt2

Dahouabdelhalim / Discourse-marksers-and-Web-crawling

IBM / cc-dbp

Improve this page

Add this topic to your repo