homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

Google Following URLs Without Hyperlinks

 3:27 pm on Nov 21, 2011 (gmt 0)

I came across some 404 pages on my site in Google WMT. These were pages that i know i definitely do not have on my site. Anyway i investigated and looked at the linking pages and they were external websites that are referencing my URLs and not actually linking using <a href> tag.

It seems Google Bot has tried to follow these URLs and reporting them as broken links.

Has anyone else come across anything like this?



 5:22 pm on Nov 21, 2011 (gmt 0)

Yes. This has been the case for at least 6 months now.


 5:25 pm on Nov 21, 2011 (gmt 0)

Yes, often. There are some recent threads about it. (Anyone got a good memory for Forums search terms?) It's infuriating but there's generally nothing you can do about it. You can fix your own links-- but when it isn't yours and it isn't even a link, about all you can do is swear at g### ;)


 5:28 pm on Nov 21, 2011 (gmt 0)

I've been seeing this a lot lately, but more so with truncations of actual URLs on my site, with ... appended to the end.

The linking pages that GWT lists as the source of these malformed links are scraper and mashup sites that use Bing's search results. These pages include an actual hyperlink to my site, as well as the truncated url...

I have also been noticing more 404 traffic coming to strange URLs - URLs that are very similar to actual URLs, but with slight changes. I haven't seen any of these show up GWT, though.


 6:27 pm on Nov 21, 2011 (gmt 0)

If it is a truncation or has some trailing punctuation that is causing a 404 I generally put in a 301 redirect. If googlebot can't figure it out, then many users that copy and paste probably can't figure it out either. Another good reason to have short urls. Fewer ways to mess them up.


 6:34 pm on Nov 21, 2011 (gmt 0)

Interesting, i wasn't aware that this has been the case for a few months now. I've only started seeing them.
These sites do seem like scrapers that use span tags with css class 'msnresulturl' on truncation urls of my pages.
How annoying.


 6:35 pm on Nov 21, 2011 (gmt 0)

So does Google follow all urls or just truncation ones?


 7:34 pm on Nov 21, 2011 (gmt 0)

I've had this issue since summer (at least). In my case I put link url in youtube video description. Looks like this url then a blank line then a text. Youtube video scrappers put the video on their website and googlebot concatenates the text below the url to the end of it... So now I put an extra space after the url before leaving a blank line below... Hope this will help


 9:44 pm on Nov 21, 2011 (gmt 0)

It appears that the Googlebot will follow any link, anything in text or an actual link, in our case we have "..." at the end, anchor text mixed in with the link, html code included and a bad case of a trailing / after filename.htm. Sometimes these actually resolve - which then create a load of file not founds as the relative paths are incorrect.

I would have thought they would have discounted these at once as they come from off-site, but no, the bot aimlessly follows them.


 1:20 am on Nov 22, 2011 (gmt 0)

Although annoying, it is not the problem if such URLs are returning 404.

However, if such partial URL returns 200 OK (and you do not have a canonical link element implemented) then you might have a duplicate content problem on your hand that can quickly escallate, so watch Duplicate Titles/Description section of WMT for it.


 12:54 pm on Nov 22, 2011 (gmt 0)

I can see a bit of a problem here. If you have in your htaccess a line to redirect hits to the non-www version of your site to the www version, and then some scraper site links to a non-existent page without the www, Googlebot will get a 301 response followed by a 404. This then looks like it's a problem with your site, not the external link.

Would people say it would be fine to just send any non-www hits to the home page with the www? If, say, you knew that virtually no-one is linking to anything other than your home page with valid non-www links anyway?


 1:14 pm on Nov 22, 2011 (gmt 0)

A 301 redirect of a page that has never actually existed to a 404 page isn't a problem. Google will tolerate the behavior without thinking your site has problems. You should still further redirect such requests when you find them.


 7:08 pm on Nov 22, 2011 (gmt 0)

I'm seeing them try to link shortened links
like www.example.com/directory/long....page.htm
where the dots are how possibly some forum shortens it, but the href tag is unchanged. The link in the forum works, but if they just spider the text instead of the actual href tag it's obviously broken.


 9:01 am on Nov 23, 2011 (gmt 0)

John Mueller says at at Google Help Forum [google.com] they are using non-hyper-texted URLs to find new content and it sometimes causes 404 in WMT.


 11:30 pm on Nov 23, 2011 (gmt 0)

Doesn't sound good. Potential for sabotage here?

Robert Charlton

 6:29 am on Jan 2, 2012 (gmt 0)

suzukik - Thanks for the link to the Google Help Forum thread... [google.com...]

To keep this discussion self-contained, it's worth quoting some of John Mueller's comments from the thread here....

My emphasis below...
Those links appear to come from text on those pages. We've started picking up text that looks like URLs on HTML pages and seeing if they lead to new content. Sometimes those links are truncated and useless, but it's easy enough to try them and forget them if they lead nowhere, so we've started doing checking them to be sure. We primarily use these kinds of URLs for discovering new content.

It's similar to how we pick up links in PDFs, text-files, and JavaScript, we just want to make sure that we have as many possible sources of URLs covered as possible :-).

I realize that this can lead to a somewhat cluttered crawl errors section in Webmaster Tools, so we're looking into ways of making that a bit clearer....

John also points out that 404s for URLs that don't exist are normal and should present no problems. Regarding the 301s, he says...

At any rate, you don't need to "fix" this problem (eg with a 301 redirect), if you're sure that the URL should really not exist.


 7:36 am on Jan 2, 2012 (gmt 0)

We tag outlinks on pages in Google Analytics using _TrackPageView, they follow those without hyperlinks which is pretty stupid.


 5:25 am on Apr 10, 2012 (gmt 0)

Today in GWT I found two really peculiar 404's in that they were shown as "Linked From" an existing page where nothing even close to that link structure occurs. The links were shown as relative links with a non-existent directory appended, then 8 or nine numbers.php - a format I don't use anywhere. The only thing I could find out after some searching through help was that such 404s might be caused by the bots' attempts to decipher javascripts. That doesn't make sense to me. There are only two very simple javascripts on the Linked From page: one framebreaker and a pop-up script used for larger images. The only .php files on the site are a few menu and footer includes. The number strings in their '123456789.php' filenames don't exist on that page. Because these two 404's just showed up since last week I guess I will need to check into GWT more often. The rest of the 404s shown are malformed links on scraper search/AdSense sites. It is nice that they are seeking out new content, but this is just manufacturing URLs from air.


 6:19 am on Apr 10, 2012 (gmt 0)

Doesn't really boost their case for "Please let us crawl your .js files" does it :(

So far, the worst I've got is "linked from" ... a page that itself no longer exists. (Darn! If I paid closer attention before clicking "Fixed", I would notice if they ever had two long-gone pages linking only to each other.) And google's perennial favorite, attributing nonexistent pages to a sitemap from 2008 even though it's "no data available" when you press for details.

I'm getting a lot right now because I de-blocked a bunch of pages I deleted a year ago so they're picking up 410s right and left. Happened to coincide with a chance in their way of recording "errors". My goodness, that googlebot has a long memory.

Still think it would be nice if they told me the names of those 78 pages they profess to have been roboted-away from. Then at least I could be sure they were those same long-deleted pages, and not other pages that google isn't even supposed to know about. (I found three back when they were naming names. They're only linked from one place in the world-- and that place is also securely roboted-out, so how does g### find out about them? Don't answer that. It's going to be one of those search-engine technicalities that I can't wrap my brain around, isn't it.)


 9:32 am on Apr 16, 2012 (gmt 0)

A 301 redirect of a page that has never actually existed to a 404 page isn't a problem

If the bot sees it, it may create a soft error in GWT. Do a valid redirect to a valid page, or a 4x right away.

they are using non-hyper-texted URLs to find new content and it sometimes causes 404 in WMT.

I've replaced the 404s years ago with redirects so I don't get any errors even if the bot finds something that looks like a link. Looks like infinite 404 error space may hurt after all.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved