Skip to content

Extract witcher epub books content, filter based on rules with beautiful soup and export a txt file suitable for GPT-2 fine-tuning

Notifications You must be signed in to change notification settings

thetobysiu/witcher-books-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

Extract epub files using beautifulsoup parser and filter content such as quote and some non-witcher related contents.

Create a dictionary structure separating sentences, chapters, and books.

Outputs a txt file with GPT-2 <|endoftext|> token prepended to each chapter/books.

Steps

Run main.py to create witcher.txt

parse.py is the module for parsing the books, it will read epubs inside books/ folder, the epubs must be uncompressed and original.

Jupyter notebook is available.

About

Extract witcher epub books content, filter based on rules with beautiful soup and export a txt file suitable for GPT-2 fine-tuning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published