disallow_domains #2376

mohmad-null · 2016-11-02T14:36:34Z

Feature request
Currently there's "allowed_domains" to create a whitelist of domains to scrape.

It would be good if there was a "disallowed_domains" or "blocked_domains" as well. I appreciate I could could probably do this in middleware, but I figure it's something quite a few people would want.

djunzu · 2016-11-02T19:21:39Z

Not against the feature,, but one problem with disallowed_domains is that your spider could end crawling the entire internet. Simple example: there is a link to yahoo.com in some page you crawled ant it is not on disallowed_domains Result: your spider will be possibly be crawling for days websites you don't want.

kmike · 2016-11-02T19:27:16Z

I had the same thoughts as @djunzu. What is your use case @mohmad-null?

mohmad-null · 2016-11-05T20:57:32Z

I'm doing some general crawling down to a depth of 3 or 4 starting at various seed sites but not limiting it to certain domains.

I don't want to crawl certain big sites (like facebook, yahoo, google, etc) that everyone seems to point at, only "smaller" sites.
For this scenario a blacklist is much more useful than a whitelist. I've implemented it in middleware for now but it'd still be a nice-to-have

redapple · 2016-11-07T16:22:31Z

@mohmad-null , would you want to share your middleware by any chance?

mohmad-null · 2016-11-10T10:48:56Z

@redapple ; Sure. I have a global variable BLOCKED_DOMAINS set elsewhere like this:
BLOCKED_DOMAINS = open('\\path\to\file\blocked_domains.txt')).read().splitlines()

Which reads from a file that basically is just a list of URL's:

yahoo.com
google.com
facebook.com
....

And then the middleware class looks like this:

class BlockedDomains(object):

    # Bad domains that are full of stuff we don't want.
    def process_request(self, request, spider):
        url = request.url.lower()
        for domain in OGC_BLOCKED_DOMAINS:
            if domain in url:
                raise exceptions.IgnoreRequest

There are probably better (and almost certainly more optimised) ways to do this.

felipeboffnunes · 2023-03-29T11:32:23Z

I guess this can be done by adding logic to get_host_regex method in scrapy.spidermiddlewares.offsite.OffsiteMiddleware. What do you think @Gallaecio ?

Gallaecio · 2023-03-29T11:55:54Z

Yes, that‘s the middleware where this would be ideally implemented.

Gallaecio added the enhancement label Jul 8, 2019

felipecustodio linked a pull request May 5, 2023 that will close this issue

feat: add disallowed_domains option to OffsiteMiddleware #5922

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disallow_domains #2376

disallow_domains #2376

mohmad-null commented Nov 2, 2016

djunzu commented Nov 2, 2016

kmike commented Nov 2, 2016

mohmad-null commented Nov 5, 2016

redapple commented Nov 7, 2016

mohmad-null commented Nov 10, 2016 •

edited

felipeboffnunes commented Mar 29, 2023

Gallaecio commented Mar 29, 2023

disallow_domains #2376

disallow_domains #2376

Comments

mohmad-null commented Nov 2, 2016

djunzu commented Nov 2, 2016

kmike commented Nov 2, 2016

mohmad-null commented Nov 5, 2016

redapple commented Nov 7, 2016

mohmad-null commented Nov 10, 2016 • edited

felipeboffnunes commented Mar 29, 2023

Gallaecio commented Mar 29, 2023

mohmad-null commented Nov 10, 2016 •

edited