Welcome to WebmasterWorld Guest from 52.206.226.77

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

URLs with long alphanumeric strings dynamically added - why? Gaming Google?

     
2:46 am on Nov 30, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member play_bach is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 20, 2005
posts:3076
votes: 4


There's a site that has been dominant in a niche I follow for no apparent reason for it's keywords. I notice that unlike everybody else, they keep changing the URL dynamically by adding very long alphanumeric strings right before the .html part of the URL. Like so:
www.example.com/Key-Word-a483ef692f4d689e436973e128c9a24.html

Can anybody shed some light on this technique and why it seems to be getting past Google? Thanks.
3:16 am on Nov 30, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


Interesting. I have seen with URLs that have what appears to be byte array added as a part of URL, but the ones I have seen were not being changed dynamically.

How dynamically is it changed? Daily? Weekly? On page refresh (eg. redirects to new version).

When you say it is dynamically changing, do you mean URL in SERPs or internal link or both?

Have you tried to see what happens with the previous incarnation when the URL is dynamically changed? Does the old URL redirect to a new version?
3:31 am on Nov 30, 2014 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


It's 32 characters, all a-z or 0-9. Probably an md5 checksum, which fits the same criteria.

I suspect it's just a way to ensure they are getting unique urls. For example, perhaps they are getting the md5 checksum of the current date plus the keyword. That would generate a unique url that wouldn't be repeated.
3:37 am on Nov 30, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


perhaps they are getting the md5 checksum of the current date plus the keyword

But then what happens if you use yesterday's URL? Or, more to the point: G### can't possibly not notice that every URL they've ever crawled now leads to a 301. If nothing else, the site would be flooded with requests for
/djkfl34j5iocjv.html
(the cat-on-the-keyboard URLs that google asks for when it suspects a site of returning "soft 404s").

How long has it been going on?
4:03 am on Nov 30, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member play_bach is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 20, 2005
posts:3076
votes: 4


Good questions and honestly folks, a bit over my head. I can say it has been going on for a couple of years at least, maybe more. At first, I just dismissed it as some weird keyword-type stuffing mod of the URL, but then they shot right to the top of the SERPS so maybe they know something I don't?

On top of that, the page I follow is bloated with tons of javascript and then minified with some 47,000+ lines of code. The webmaster is based in Poland and I checked out his personal site where it's clear from his resume (even though I can't read Polish!) that he's a skilled MySQL, PHP coder so I don't think what he's doing on the site in question is an accident.

I just find it odd that long keyword+number URL constructs like this can apparently bypass Google's filters when on the other hand, squeaky clean URLs are getting hit.

Over on Bing, it's a different story as they don't even show up for the keyword until the bottom of page two. Also, on Bing they cache the query string without the numbers, like so:
www.example.com/index.php?action=abc&defg=hijk

[edited by: Play_Bach at 4:22 am (utc) on Nov 30, 2014]

4:21 am on Nov 30, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


I am still not clear what it means dynamically changing. Do they change all their internal links to a new URL with a different number, and if so, how often? Daily, weekly, within a day?
4:29 am on Nov 30, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member play_bach is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 20, 2005
posts:3076
votes: 4


@aakk999 - sorry was editing when you replied and wasn't able to respond.

I just posted the URL format above that Bing caches:
www.example.com/index.php?action=abc&defg=hijk .

It's definitely dynamic, though I'm not sure of the frequency changing. When I click on Google's link, the URL doesn't change. But then on Bing, the URL doesn't have the numbers or .html component. Also, the URL doesn't go to the keyword page if entered directly into the address bar. It only works as a result of a query for the keyword.
2:54 pm on Nov 30, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member redbar is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Oct 14, 2013
posts:3141
votes: 453


Until earlier this year, and for at least 2-3 years, I had seen the same thing happening with hundreds of Chinese domains all owned by the same company each one specifically targeting one keyword phrase and Google was ranking them extremely well, they would have as many as 10-12 domains in the top 20.

I haven't see them for a while therefore I assume either Google's sussed it out or they just are not bothering since it's simply not worthwhile any more...I suspect the latter.
5:20 pm on Nov 30, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 13, 2003
posts: 1053
votes: 3


A 32 character string? The string in your first post only includes 0-9 and a-f, g-z are missing. That makes it look like a hexadecimal number to me.

Do all the pages that have this only have 0-9 and a-f?

Then, why would someone need a 128-bit number in a URL?
11:23 pm on Nov 30, 2014 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


g-z are missing


Consistent with using an MD5 hash (I should have said a-f originally). There's not enough info in the original post to understand how these urls are being used, but I'm pretty sure the hex string is an MD5 hash. It's an easy way to make sure urls you are generating are unique, consistent, and don't look obviously like they are date related.
11:33 pm on Nov 30, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 13, 2003
posts: 1053
votes: 3


Consistent with using an MD5 hash


Oh, I never knew that. You learn something every day!

So why is Google pushing this content up? Freshness? Are they not checking against previous pages and seeing a duplicate page?
11:55 pm on Nov 30, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


What I'm curious about is how the site avoids being flooded with Googlebot requests for every URL it has ever seen, which by this point must outnumber active URLs by about 1000 to one ("a couple of years", assuming the URL really changes every day).
12:11 am on Dec 1, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


@Play_Bach,

It would be interesting if you could catch two URLs for the same keword and see what happens when you request each of them - does the old one redirect? E.g. it would be interesting request both if you are aware of existance:

www.example.com/uniqueKeyWord1-someHexStringhere.html
www.example.com/uniqueKeyWord1-differentHexStringhere.html

It would also be interesting to search for a unique paragraph on one of these pages in combination with "site:example.com" to see if there are "old" URLs with a different hex string in Google index.

Does the page have canonical? If so, what does it say?

But then on Bing, the URL doesn't have the numbers or .html component

Is there a difference in robots.txt with regards to googlebot and bingbot?
1:27 am on Dec 1, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


or .html component

?! Is each page accessible both ways, with and without .html? Try a few and see.

It's sooo tempting to interpret this as meaning bing is, in this respect, more intelligent than google ;) But what I said above about Googlebot requests would go double for bing, because they never stop asking for old URLs.
1:45 am on Dec 1, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Oct 5, 2012
posts:921
votes: 181


We use this type of url to avoid duplicate urls/files/folders.

Could also be used for tracking, referral links, affiliates, and possibly gaming google's freshness algo.
9:36 am on Dec 1, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member play_bach is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 20, 2005
posts:3076
votes: 4


It would also be interesting to search for a unique paragraph on one of these pages in combination with "site:example.com" to see if there are "old" URLs with a different hex string in Google index.


@aakk9999 - OK, just did that. Google returned some 2,500 pages all with the same meta descriptions. Many of the pages have the alphanumeric strings in the URLs, many don't, so it appears to be selective about which URLs the site chooses to encode.

Adding the MD5 type string to game freshness I suspect is what's going on here. Beats me how they're getting away with it though.
2:14 pm on Dec 2, 2014 (gmt 0)

Junior Member

joined:July 29, 2014
posts:47
votes: 0


Perhaps they are hashing query parameteres (action=abc&defg=hijk =>a483ef692f4d689e436973e128c9a24). This way google cannot tell what parameters they should ignore and they have to look at the content of the page. This will probably generate a lot of duplicate content but Google will bubble up the page it considers to better answer a query. The webmaster may then infer why a particular version of the page is ranking better and plan SEO accordingly.
3:05 pm on Dec 2, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


It is possible that it is done for the reason trabis says above.

@Shepherd
We use this type of url to avoid duplicate urls/files/folders

I am confused. Are you saying you have a different URL but the page shows the same content?

Or are you using something like this just to make sure two DIFFERENT pages do not attempt to use the same URL? Kind of like an ID in URL?
3:38 pm on Dec 2, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month

joined:Oct 5, 2012
posts:921
votes: 181


@aakk9999

We have several writers creating our own content and also UGC. We use article titles as page urls. So, when our system creates the page it adds a unix time stamp to the end of the url to avoid possible duplicate urls.

/green-widgets-in-oklahoma-1413241632.html
/green-widgets-in-oklahoma-1409852741.html

The pages would contain different articles or content, they just happen to have the same title.

Doesn't happen very often, but the first time it did it was a big mess, this was a simple and quick fix.
3:58 pm on Dec 2, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


Thanks for clarifying - so it is like ID in URL.

I don't think this is the same as what Play_Bach reports. From what I understood, the page content is the same over there, but URLs change, the reason may be what trabis described.
6:02 pm on Dec 2, 2014 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12237
votes: 364


Does the page have canonical? If so, what does it say?

This point hasn't been addressed in responses, but is likely an important clue to what's going on.

Current urls in the Washington Post, eg, appear to be adding hex identifier strings on articles... but from what I've seen there's only one version of each page, and the canonical tag carries a string that's consistently the same as the string in the url. The identifiers are 36 characters. I can imagine that somewhere in these strings, 4 of the characters might be used to classify page type or whatever.
11:37 pm on Dec 2, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


I would guess it's about tracking and any "gaming" that might be going on is an unintentional side benefit, but that is just a guess of course.

Have you tried to access a page from Google with an empty cache and your JavaScript turned off to see if you still get redirected?
11:43 pm on Dec 2, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4982
votes: 42


It could easily be someone's idea of "avoiding session parameter GET variables that confuse search engines" and mistakenly thinking putting it in the path makes it better. It's not a problem for today's search engines, but I've seen people draw unusual conclusions like that.
1:09 am on Dec 3, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member play_bach is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 20, 2005
posts:3076
votes: 4


Have you tried to access a page from Google with an empty cache and your JavaScript turned off to see if you still get redirected?


@TheMadScientist - Redirecting isn't what's happening here. The link on Google will go to the page even with all the numbers in it. On Yahoo, the same link doesn't have the numbers and goes to the same page. On Google, the page shows as .html and on Bing/Yahoo as a php query.
1:26 am on Dec 3, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


Hmmm, interesting...

Also, the URL doesn't go to the keyword page if entered directly into the address bar. It only works as a result of a query for the keyword.

The preceding sounds like a referrer-based, likely JS or meta-refresh, redirect to me, but I could be misunderstanding or reading something incorrectly.
1:45 am on Dec 3, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


The link on Google will go to the page even with all the numbers in it. On Yahoo, the same link doesn't have the numbers and goes to the same page.

Yahoo's results are provided by Bing -- Bing, in an effort to not have to deal with duplicate content, will not index the same content twice, so I'm not surprised there's a difference between Google and Bing/Yahoo -- Google indexes any and every thing it can then guesses which version of a duplicate people want to see. Bing indexes unique pages and when it finds a duplicate or near-duplicate of a page already in it's index it won't index the secondary URL.