Skip to content

通过知识获得解放,通过技术获得自由

Menu
  • 专题目录
  • 液压相关
    • 液压技术
    • 液压相邻技术
    • 液压应用
  • 计算机相关
    • 计算机和软件
    • 网络和网站技术
  • 哲学
  • 关于本站
Menu

Nginx反爬虫[0]

Posted on 2023年11月14日2025年5月6日 by

本篇是lnmp加固中修改denyrobots.conf配置。

 

网站反爬虫robots的原因
1)不遵守规范的爬虫会影响网站的正常使用
2)网站上的数据是网站的重要资产
3)爬虫对网站的爬取会造成网站统计数据的污染

 

  • 设置步骤

站点配置文件因为user-agent带有Bytespider爬虫标记,这可以通过Nginx规则来限定流氓爬虫的访问,直接返回403错误。

#forbidden Scrapy
  if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
    return 403;
  }

#forbidden UA
  if ($http_user_agent ~ "Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|python-requests|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) 
  {
    return 403;
  }

  if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot")
    {
        return 403;
    } 
  
#forbidden not GET|HEAD|POST method access
  if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
  }

 

  • 测试

在第三方vps上测试

1)第一次测试——模仿奇虎访问http

curl -I -A 'qihoobot' www.fanlog.org

主要原因是采用nginx stream导致

2)第二次测试——模仿奇虎访问https

curl -I -A 'qihoobot' https://www.fanlog.org

3)第三次测试——浏览器标识空

curl -I -A ' ' https://www.fanlog.org

4)第四次测试——正常chrome访问

curl -I -A 'chrome' https://www.fanlog.org

5)第五次测试-解决api.w.org问题

curl -I -A 'chrome' https://www.fanlog.org

 

补充说明:

1)UI收集

FeedDemon             内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  内容采集
Jullo                 内容采集
Feedly                内容采集
UniversalFeedParser   内容采集
ApacheBench           cc攻击器
Swiftbot              无用爬虫
YandexBot             无用爬虫
AhrefsBot             无用爬虫
YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
jikeSpider            无用爬虫
MJ12bot               无用爬虫
ZmEu phpmyadmin       漏洞扫描
WinHttp               采集cc攻击
EasouSpider           无用爬虫
HttpClient            tcp攻击
Microsoft URL Control 扫描
YYSpider              无用爬虫
jaunty                wordpress爆破扫描器
oBot                  无用爬虫
Python-urllib         内容采集
Python-requests       内容采集
Indy Library          扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot           无用爬虫

 

参考资料:

  1. https://cloud.tencent.com/developer/article/1616186
  2. https://www.rmnof.com/article/block-search-indexing/
  3. http://www.rrdaj.com/hzseo/seoxin-shou-ru-men-xue-xi/4164.html
  4. https://blog.csdn.net/weixin_43507959/article/details/106881315
  5. https://www.cnblogs.com/xiao987334176/p/12559101.html

欢迎回来

希望本站对你有所帮助!

如有疑问请联系info@fanlog.org
2023 年 11 月
一二三四五六日
 12345
6789101112
13141516171819
20212223242526
27282930 
« 6 月    

AI辅助 (17)

© 2025 | Powered by Superbs Personal Blog theme
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Cookie SettingsAccept All
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT