Forum Moderators: open

Message Too Old, No Replies

Search engines choking on phpBB

         

Kendo

11:43 pm on Jul 25, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I switched our online discussion forum from an inhouse solution to phpBB a few weeks ago

But have noticed a jump in hits over the last few days, for example:
- 2,155 users and 31,634 pages looks normal
- but 2,359 users and 115,419 pages is not normal

So I checked the logs and those hits are coming from Google and there not that much content on the forum, maybe 100 topics. I then checked links using Xenu and it completed with a list of 88,000 links!

I'll post at phpBB as well to see if they have encountered the same problem.

Kendo

1:25 am on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I stopped Xenu at 66% complete with 40,000+ links from only 49 posts.

Swanny007

4:09 am on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is normal for search engines to have to re-crawl the entire site when making a major change like changing the software that runs the forum. There are tons of URLs to index and they will pick the canonical ones and then slow down.

Kendo

4:31 am on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In this case there are only 40 topics in the forum, which means only 49 pages.

phpBB is partly the problem because it creates unique IDs that are stored in cookies. But bots don't use cookies so every link is a new area to explore. The people at phpBB are working on a new version that should include a fix for that. I am hoping that they will also expand their IP Ban feature to include netblocks instead of just IP ranges.

Most caught in the loop were from Singapore posing as Google. Banning netblocks should keep them away.

Swanny007

4:46 am on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have used phpBB for 24 years. In the last few weeks/months I have seen a sharp increase in extreme bot activity. I have blocked all of Singapore, Brazil, and some other countries as needed. I highly recommend you use free Cloudflare to help get the overbearing bot traffic under control. You can use a "managed challenge" on urls like /forums/ucp.php* which helps slow down bots but not turn off too many legit users.

tangor

9:50 am on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't know phpBB so can't say for sure, but there are other software/platforms out there that generate multiple internal linking (such as WP) that can confuse search bots regarding duplicate content with and numerous URLS. Does phpBB have a config setting to mitigate this scenario? Is there a canon directive in use?

thecoalman

3:25 pm on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now I know what Kendo stands for. LOL For the benefit of others this is the gist of what I posted on phpbb.com, I'm moderator there.

phpBB issues a SID parameter on page load when new session is started. So each link on the page has the SID appended. Since the bots do not accept cookie each page load starts new session with a new SID appended thus you get an infinite amount of "unique" URL's. I don't know why but there are some features you can enable for guests that would require session. SID's in URL's are being completely removed in phpBB 4.

As side this can also cause performance issue with phpBB's session table. This table is purged based on session length but if you have a unamed bot making making a thousand requests every minute it can be an issue.

phpBB has bots group and any bots added to this group have their own permissions like other user groups, you would set their permissions the same as guests or less.For example if you have off topic section of your forum available to public you can block access to bots if you don't want it indexed. This also strips the SID from any links and hides certain links, e.g. login, search etc. If you are going to use a tool like Xenu you need to add it's user agent to bots Group.

Most caught in the loop were from Singapore posing as Google.


As I mentioned on phpBB.com I would not be surprised they are using Googles user agent on phpBB forum to avoid getting the SID parameter. The funny thing is that's an improvement because it's makes them easily identifiable because they can no longer hide behind common user agent. If you are using Cloudflare this gets blocked out of the box.

[edited by: thecoalman at 3:45 pm (utc) on Jul 26, 2025]

thecoalman

3:38 pm on Jul 26, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have used phpBB for 24 years. In the last few weeks/months I have seen a sharp increase in extreme bot activity. I have blocked all of Singapore, Brazil, and some other countries as needed. I highly recommend you use free Cloudflare to help get the overbearing bot traffic under control. You can use a "managed challenge" on urls like /forums/ucp.php* which helps slow down bots but not turn off too many legit users


They are banging away at ucp.php and other common links like that because they appear on every page load with different SID appended.

I've been using CF for years. I have four custom rules, here is snippet from KB article I'm writing.


  • Go to Security >> Settings >> Bot traffic and enable "Bot fight mode". This will block some of the most malicious bots Cloudflare has identified. Optionally you can also enable "Block AI Bots" which will block AI scrapers that identify themselves, e.g ChatGPT.
  • Next create some rules. Go to Security >> Security Rules >> Create new Rule >> New Custom Rule. CF as an easy to use interface the following examples are from the expression generated by GUI.


  • Rules are fired in order so rule 1 is to whitelist with the action skip. This is primarily for bots you want to a have access. CF maintains a list of verified bots that adhere to robots.txt so you can add them. If you want to allow other bots not on that list just add their user agents.
    (cf.client.bot and http.user_agent wildcard r"Somebot")

  • Rule 2 is for whatever you want to outright block. You can block using a variety of criteria like ASN, User Agent, Country etc. For this example we are blocking the country T1 which is Tor network and the continent of Antarctica.
    (ip.geoip.country eq "T1" and ip.src.continent eq "AN")

  • For rule 3 we will add a rule for problematic countries and for action will issue an interactive Challenge. The interactive challenge requires the user to perform some action on screen, usually a check box. In the following example it's issued to India and China.
    (ip.geoip.country eq "IN") or (ip.geoip.country eq "CN") 

  • For rule 4 and rule we'll whitelist countries and for action we'll use is JSChallenge which is the brief "Checking your browser..." page. Countries listed here will not be challenged so add your country and countries where you expect the bulk of your traffic from. Any country not listed here and assuming they weren't blocked with any of the rules above will be issued JSChallenge. It's important to note you need to use the "Does not equal" operator with AND. In the following example the US, Canada and the UK are whitelisted.
    (ip.geoip.country ne "US" and ip.geoip.country ne "CA" and ip.geoip.country eq "GB")

  • thecoalman

    3:40 pm on Jul 26, 2025 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Is there a canon directive in use?


    There is canonical url in meta tag on every page.

    Swanny007

    9:23 pm on Jul 26, 2025 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    After reading about the challenge, I don't use JS challenge anymore, just managed (interactive) challenge, it is superior. Thanks thecolaman, I excluded Canada & USA from my challenge as that is normally legit traffic, that will cause less friction for legitimate users. Well I just did that so we will see.
    This is my exact expression, it may be right or wrong:
    (http.request.uri.path wildcard r"/forums/ucp.php*" and ip.src.country ne "CA" and ip.src.country ne "US") or (http.request.uri.path wildcard r"/forums/posting.php*" and ip.src.country ne "CA" and ip.src.country ne "US") or (http.request.uri.path wildcard r"/forums/search.php*" and ip.src.country ne "CA" and ip.src.country ne "US")

    Kendo

    9:39 pm on Jul 26, 2025 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Forums have always been a problem with crawlers because each page has many links (bells and whistles) that should be NOFOLLOW but it never works like that so for each page there can be several more links derived from or pointing to it. Scrapers like Google should have that sorted but the majority of them won't.

    phpBB is the only PHP solution running on our Windows server and we like it very much. So we are looking forward to seeing version 4.

    We looked at rate limiting and found that it has limitations in that it kills the connection rather than regulate it, and anything added to web.config is absolute, in that it cannot refer to a database table. I have been working towards a solution for our Windows server and now have the incentive to continue thanks to these discussions :-)

    thecoalman

    1:54 am on Jul 27, 2025 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    phpBB hides a lot of those unnecessary links from bots assuming the bot has been added to bots list. Logout, clear your cookies and use a user agent switcher in the browser.to use Googles user agent. Note you can't do this if site is proxied through cloudflare.

    Also while logged in as admin If you see a lot of guests online click the heading "Who is online" and then "Display guests", check the user agents to make sure it's not a named bot, add it to list as you come across them.

    Note that phpBB is in the process of fully switching to Twig syntax so you'll find a mix of Twig syntax and phpBB's custom syntax in templates. If you look through the templates you'll find a switch for bots, something like this quick example using phpBB's syntax:

    <!-- IF S_IS_BOT -->
    Author's Name
    <!-- ELSE -->
    <a href="link_to_authors_profile.php">Author's Name</a>
    <!-- ENDIF -->


    One glaring link that is not removed is the link on post titles.Personally I find the title on posts clutter so I just remove them entirely but keep the code for reference when updating. Starting on line 227 of /styles/prosilver/template/viewtopic_body.html find:

    <h3 {% if postrow.S_FIRST_ROW %}class="first"{% endif %}>
    {% if postrow.POST_ICON_IMG %}
    <img src="{{ T_ICONS_PATH }}{{ postrow.POST_ICON_IMG }}" width="{{ postrow.POST_ICON_IMG_WIDTH }}" height="{{ postrow.POST_ICON_IMG_HEIGHT }}" alt="{{ postrow.POST_ICON_IMG_ALT }}" title="{{ postrow.POST_ICON_IMG_ALT }}">
    {% endif %}
    <a {% if postrow.S_FIRST_UNREAD %}class="first-unread" {% endif %}href="{{ postrow.U_MINI_POST }}">{{ postrow.POST_SUBJECT }}</a>
    </h3>


    Replace with:
    <!-- IF 0 -->Mod - removed post titles
    <h3 {% if postrow.S_FIRST_ROW %}class="first"{% endif %}>
    {% if postrow.POST_ICON_IMG %}
    <img src="{{ T_ICONS_PATH }}{{ postrow.POST_ICON_IMG }}" width="{{ postrow.POST_ICON_IMG_WIDTH }}" height="{{ postrow.POST_ICON_IMG_HEIGHT }}" alt="{{ postrow.POST_ICON_IMG_ALT }}" title="{{ postrow.POST_ICON_IMG_ALT }}">
    {% endif %}
    <a {% if postrow.S_FIRST_UNREAD %}class="first-unread" {% endif %}href="{{ postrow.U_MINI_POST }}">{{ postrow.POST_SUBJECT }}</a>
    </h3>
    <!-- ENDIF -->

    thecoalman

    12:57 pm on Jul 27, 2025 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Forgot to add. if you edit template file you need to clear phpBB's cache. . For development purposes if you are making a lot of edits to templates under load settings you can set "Recompile stale style components:" to yes, this should be set to no for production.

    Side note, when you clear phpBB's cache from ACP if you have OPcache enabled this also clears it completely which may or not be something you want to do. phpBB is fully integrated with OPcache e.g. it will invalidate it's own cache files cached by OPcache.