-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disallow_domains #2376
Comments
Not against the feature,, but one problem with |
I had the same thoughts as @djunzu. What is your use case @mohmad-null? |
I'm doing some general crawling down to a depth of 3 or 4 starting at various seed sites but not limiting it to certain domains. I don't want to crawl certain big sites (like facebook, yahoo, google, etc) that everyone seems to point at, only "smaller" sites. |
@mohmad-null , would you want to share your middleware by any chance? |
@redapple ; Sure. I have a global variable BLOCKED_DOMAINS set elsewhere like this: Which reads from a file that basically is just a list of URL's:
And then the middleware class looks like this:
There are probably better (and almost certainly more optimised) ways to do this. |
I guess this can be done by adding logic to |
Yes, that‘s the middleware where this would be ideally implemented. |
Feature request
Currently there's "allowed_domains" to create a whitelist of domains to scrape.
It would be good if there was a "disallowed_domains" or "blocked_domains" as well. I appreciate I could could probably do this in middleware, but I figure it's something quite a few people would want.
The text was updated successfully, but these errors were encountered: