Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant whitespace from pdf text #440

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

seocahill
Copy link

It seems to be significant AFAICT?

[13] pry(main)> chunk = pages[3].text
=> "                                       INTRODUCTION\n\n\n\n   THEIRISH LANGUAGE\n\n   Irish is one of the many languages spoken aross Europe and as far east as India, that trace\ntheir descent from Indo-European, a hypothetical ancestor-language thought to have been\nspoken more than 4,500 years ago. Irish belongs to the Celtic branch of the Indo-European\nfamily, as the diagram below shows. It and three other members of this branch - Scottish\nGaelic,Welsh and Breton - are today alive ascommunity languages.\n\n   The form of Celtic that was to become Irish was brought to Ireland by the invading Gaels\n- about300 B.C. according to some scholars. Later it spread to Scotland and the Isle of Man.\nScottish Gaelic and Manx gradually separated from Irish (and,more slowly,from each other),\nand they can be thought of as distinct languages from the seventeenth century onwards. The\nterm 'Gaelic* may beused todenote all three.\n\n\n   It appears that the early Irish learned the art of writing at about the timeof their conversion\nto Christianity, in the fifth century. After that, the language can be seen to go throughfour\nstages of continuous historical development, asfar as its written form is concerned: Old Irish\n(approximately A.D. 600- 900), Middle Irish (c. 900- 1200), Early Modern Irish (c. 1200 -\n1650), and Modern Irish. Throughout this development Irish borrowed words from other\nlanguages it came into contact with (pre-eminently from Latin, from Norse, from Anglo-\nNorman (a dialect of French), andfrom English.\n\n   From the earliest times Irish has been cultivated for literature and learning. It in fact\npossesses one of the oldest literatures in Europe.\n\n\n\n                                       INDOEUROPEAN\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                                                                                          Sanskrit\n                                                                                          Hindi\n                                                                                          Urdu\n                                                                                          Bengali\n                                                                                          Punjabi\n               GERMANIC\n                                                                                          Singhalese\n                                         CELTIC             Latin, etc.                     etc.\n       Icelandic     English\n       Faroese       Frisian                                 Portuguese\n       Norwegian     Dutch                       \\\n       Swedish       German        Irish                     Spanish\n                                   Scottish-     Welsh       Catalan\n       Danish          etc.          Gaelic      Breton      French\n         etc.                      Manx          Cornish     Italian\n                                                             Rumanian\n                                                               etc."
[14] pry(main)> enc.encode(chunk).length
=> 498
[15] pry(main)> enc.encode(chunk.gsub(/\s+/, ' ')).length
=> 415

Gsub a single whitespace for a
sequence of whitespaces when 
parsing pdf text.
Copy link
Author

@seocahill seocahill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but I'm very new to this. Thanks for the gem btw - great work 👍

@andreibondarev
Copy link
Collaborator

@seocahill I'm wondering if removing extra white space and new lines should be the responsibility of a chunker and not processors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants