I've been looking into this issue the last few days now. I don't like the addition of the "sid" in guest urls for one. And the sid in the url makes them all unique which seems to really make them go crazy with re-downloading the same content over and over again. And my main complaint was how archive.org/wayback machine would index the site. With the sid in the url, it makes that service almost useless.
To deal with this I have made two changes to the phpbb-software:The first change is in session_begin in sessions.php (the four last lines is the actual change):
- Requests with a sid which cannot be found in the database are redirected to a static html file; stating that their session has expired (and with a link back to the main forum page)
- Only registered users will get links with sids (otherwise you cannot change to administrator mode).
This change means that with very little resource load the "bots" are redirected to a very neutral page, which only contains static links.Code:
// if session id is setif (!empty($this->session_id)){$sql = 'SELECT u.*, s.*FROM ' . SESSIONS_TABLE . ' s, ' . USERS_TABLE . " uWHERE s.session_id = '" . $db->sql_escape($this->session_id) . "'AND u.user_id = s.session_user_id";$result = $db->sql_query($sql);$this->data = $db->sql_fetchrow($result);$db->sql_freeresult($result); // silly bot counter-fit if (!isset($this->data['user_id'])){ redirect("/expired.htm"); }
As a result the web server now actually have resources to service regular users.
The second change is in append_sid (the root cause of it all) in functions.php.Code:
// Append session id and parameters (even if they are empty)// If parameters are empty, the developer can still append his/her parameters without caring about the delimiter global $user; if ($session_id && $user->data['is_registered']) { return $url . (($append_url) ? $url_delim . $append_url . $amp_delim : $url_delim) . $params . ((!$session_id) ? '' : $amp_delim . 'sid=' . $session_id) . $anchor; } else { return $url . (($append_url) ? $url_delim . $append_url . $amp_delim : $url_delim) . $params . $anchor; }
I was ready to modify phpBB to get rid of the sid's but I see in phpbb/session.php the sid and cookies are mostly disable for bot traffic.
See in session_create():
Code:
// Bot user, if they have a SID in the Request URI we need to get rid of it // otherwise they'll index this page with the SID, duplicate content oh my! if ($bot && isset($_GET['sid'])) { send_status_line(301, 'Moved Permanently'); redirect(build_url(array('sid'))); }Code:
bot_name | bot_agent | user_lastvisit ---------------------------+--------------------------------------------+------------------------ Meta [Bot] | meta-externalagent/ | 2025-06-16 19:56:14-05 PetalBot [Bot] | PetalBot | 2025-06-16 19:53:33-05 Ahrefs [Bot] | AhrefsBot/ | 2025-06-16 19:48:27-05 DotBot [Bot] | DotBot/ | 2025-06-16 19:41:06-05 DuckDuckGo [Bot] | DuckDuckBot/ | 2025-06-16 19:13:22-05 Google [Bot] | Googlebot | 2025-06-16 19:01:32-05 Semrush [Bot] | SemrushBot/ | 2025-06-16 18:16:16-05 Amazon [Bot] | Amazonbot/ | 2025-06-16 17:44:12-05 Bing [Bot] | bingbot/ | 2025-06-16 15:26:01-05 Majestic-12 [Bot] | MJ12bot/ | 2025-06-16 11:18:25-05 AdsBot [Google] | AdsBot-Google | 2025-06-16 09:16:18-05 AmazonVideo [Bot] | Amazonbot-Video/ | 2025-06-16 07:39:11-05 TikTokSpider [Bot] | TikTokSpider | 2025-06-16 06:28:08-05 Bytespider [Bot] | Bytespider | 2025-06-16 01:34:29-05 Archive.org [Bot] | archive.org_bot | 2025-06-16 01:11:08-05 GPTBot [Bot] | GPTBot/ | 2025-06-15 13:38:56-05 Awario [Bot] | AwarioBot/ | 2025-06-14 21:18:21-05 Seekport [Bot] | Seekbot/ | 2025-03-06 04:05:55-05 Baidu [Spider] | Baiduspider | 2025-03-06 04:00:31-05Code:
"GET /viewtopic.php?p=5177&sid=01108583db15c8648ee8cc1316e5e1 HTTP/2.0" 301 130 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)""GET /viewtopic.php?p=4386&sid=a5fb9503309f74ef676a8e35f92763 HTTP/2.0" 301 130 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)""GET /viewtopic.php?p=5177 HTTP/2.0" 200 14111 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)""GET /viewtopic.php?p=4386 HTTP/2.0" 200 8548 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"But I was still having issues when submitting my site to archive.org via the "Save page now" It was still saving my page with all the urls containing a sid. A little bit of digging and I noticed the first fetch it makes to my site was with a headless Chrome 130 browser with a regular user agent. So the page it fetches would be as a guest view. From there the "archive.org_bot" bots would come in downloading content and links, but it would be using urls from the first fetch with the sids. A bit more digging and I see if you use their API that you can specify a user agent. And that fixed that problem, so the initial request that is made is given the "bot" view without sid urls. I just made a script and cron'd it to run every couple of months. Annoying, but better than it was.
Statistics: Posted by esaym — Tue Jun 17, 2025 1:29 am