Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disallow_domains #2376

Open
mohmad-null opened this issue Nov 2, 2016 · 7 comments · May be fixed by #5922
Open

disallow_domains #2376

mohmad-null opened this issue Nov 2, 2016 · 7 comments · May be fixed by #5922

Comments

@mohmad-null
Copy link

Feature request
Currently there's "allowed_domains" to create a whitelist of domains to scrape.

It would be good if there was a "disallowed_domains" or "blocked_domains" as well. I appreciate I could could probably do this in middleware, but I figure it's something quite a few people would want.

@djunzu
Copy link
Contributor

djunzu commented Nov 2, 2016

Not against the feature,, but one problem with disallowed_domains is that your spider could end crawling the entire internet. Simple example: there is a link to yahoo.com in some page you crawled ant it is not on disallowed_domains Result: your spider will be possibly be crawling for days websites you don't want.

@kmike
Copy link
Member

kmike commented Nov 2, 2016

I had the same thoughts as @djunzu. What is your use case @mohmad-null?

@mohmad-null
Copy link
Author

I'm doing some general crawling down to a depth of 3 or 4 starting at various seed sites but not limiting it to certain domains.

I don't want to crawl certain big sites (like facebook, yahoo, google, etc) that everyone seems to point at, only "smaller" sites.
For this scenario a blacklist is much more useful than a whitelist. I've implemented it in middleware for now but it'd still be a nice-to-have

@redapple
Copy link
Contributor

redapple commented Nov 7, 2016

@mohmad-null , would you want to share your middleware by any chance?

@mohmad-null
Copy link
Author

mohmad-null commented Nov 10, 2016

@redapple ; Sure. I have a global variable BLOCKED_DOMAINS set elsewhere like this:
BLOCKED_DOMAINS = open('\\path\to\file\blocked_domains.txt')).read().splitlines()

Which reads from a file that basically is just a list of URL's:

yahoo.com
google.com
facebook.com
....

And then the middleware class looks like this:

class BlockedDomains(object):

    # Bad domains that are full of stuff we don't want.
    def process_request(self, request, spider):
        url = request.url.lower()
        for domain in OGC_BLOCKED_DOMAINS:
            if domain in url:
                raise exceptions.IgnoreRequest

There are probably better (and almost certainly more optimised) ways to do this.

@felipeboffnunes
Copy link
Member

I guess this can be done by adding logic to get_host_regex method in scrapy.spidermiddlewares.offsite.OffsiteMiddleware. What do you think @Gallaecio ?

@Gallaecio
Copy link
Member

Yes, that‘s the middleware where this would be ideally implemented.

@felipecustodio felipecustodio linked a pull request May 5, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants