URLs with long alphanumeric strings dynamically added - why? Gaming Google? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

URLs with long alphanumeric strings dynamically added - why? Gaming Google?

Play_Bach

2:46 am on Nov 30, 2014 (gmt 0)

There's a site that has been dominant in a niche I follow for no apparent reason for it's keywords. I notice that unlike everybody else, they keep changing the URL dynamically by adding very long alphanumeric strings right before the .html part of the URL. Like so:
www.example.com/Key-Word-a483ef692f4d689e436973e128c9a24.html

Can anybody shed some light on this technique and why it seems to be getting past Google? Thanks.

aakk9999

3:16 am on Nov 30, 2014 (gmt 0)

Interesting. I have seen with URLs that have what appears to be byte array added as a part of URL, but the ones I have seen were not being changed dynamically.

How dynamically is it changed? Daily? Weekly? On page refresh (eg. redirects to new version).

When you say it is dynamically changing, do you mean URL in SERPs or internal link or both?

Have you tried to see what happens with the previous incarnation when the URL is dynamically changed? Does the old URL redirect to a new version?

rish3

3:31 am on Nov 30, 2014 (gmt 0)

It's 32 characters, all a-z or 0-9. Probably an md5 checksum, which fits the same criteria.

I suspect it's just a way to ensure they are getting unique urls. For example, perhaps they are getting the md5 checksum of the current date plus the keyword. That would generate a unique url that wouldn't be repeated.

lucy24

3:37 am on Nov 30, 2014 (gmt 0)

perhaps they are getting the md5 checksum of the current date plus the keyword

But then what happens if you use yesterday's URL? Or, more to the point: G### can't possibly not notice that every URL they've ever crawled now leads to a 301. If nothing else, the site would be flooded with requests for
/djkfl34j5iocjv.html
(the cat-on-the-keyboard URLs that google asks for when it suspects a site of returning "soft 404s").

How long has it been going on?

Play_Bach

4:03 am on Nov 30, 2014 (gmt 0)

Good questions and honestly folks, a bit over my head. I can say it has been going on for a couple of years at least, maybe more. At first, I just dismissed it as some weird keyword-type stuffing mod of the URL, but then they shot right to the top of the SERPS so maybe they know something I don't?

On top of that, the page I follow is bloated with tons of javascript and then minified with some 47,000+ lines of code. The webmaster is based in Poland and I checked out his personal site where it's clear from his resume (even though I can't read Polish!) that he's a skilled MySQL, PHP coder so I don't think what he's doing on the site in question is an accident.

I just find it odd that long keyword+number URL constructs like this can apparently bypass Google's filters when on the other hand, squeaky clean URLs are getting hit.

Over on Bing, it's a different story as they don't even show up for the keyword until the bottom of page two. Also, on Bing they cache the query string without the numbers, like so:
www.example.com/index.php?action=abc&defg=hijk

[edited by: Play_Bach at 4:22 am (utc) on Nov 30, 2014]

aakk9999

4:21 am on Nov 30, 2014 (gmt 0)

I am still not clear what it means dynamically changing. Do they change all their internal links to a new URL with a different number, and if so, how often? Daily, weekly, within a day?

Play_Bach

4:29 am on Nov 30, 2014 (gmt 0)

@aakk999 - sorry was editing when you replied and wasn't able to respond.

I just posted the URL format above that Bing caches:
www.example.com/index.php?action=abc&defg=hijk .

It's definitely dynamic, though I'm not sure of the frequency changing. When I click on Google's link, the URL doesn't change. But then on Bing, the URL doesn't have the numbers or .html component. Also, the URL doesn't go to the keyword page if entered directly into the address bar. It only works as a result of a query for the keyword.

RedBar

2:54 pm on Nov 30, 2014 (gmt 0)

Until earlier this year, and for at least 2-3 years, I had seen the same thing happening with hundreds of Chinese domains all owned by the same company each one specifically targeting one keyword phrase and Google was ranking them extremely well, they would have as many as 10-12 domains in the top 20.

I haven't see them for a while therefore I assume either Google's sussed it out or they just are not bothering since it's simply not worthwhile any more...I suspect the latter.

PCInk

5:20 pm on Nov 30, 2014 (gmt 0)

A 32 character string? The string in your first post only includes 0-9 and a-f, g-z are missing. That makes it look like a hexadecimal number to me.

Do all the pages that have this only have 0-9 and a-f?

Then, why would someone need a 128-bit number in a URL?

rish3

11:23 pm on Nov 30, 2014 (gmt 0)

g-z are missing

Consistent with using an MD5 hash (I should have said a-f originally). There's not enough info in the original post to understand how these urls are being used, but I'm pretty sure the hex string is an MD5 hash. It's an easy way to make sure urls you are generating are unique, consistent, and don't look obviously like they are date related.

PCInk

11:33 pm on Nov 30, 2014 (gmt 0)

Consistent with using an MD5 hash

Oh, I never knew that. You learn something every day!

So why is Google pushing this content up? Freshness? Are they not checking against previous pages and seeing a duplicate page?

lucy24

11:55 pm on Nov 30, 2014 (gmt 0)

What I'm curious about is how the site avoids being flooded with Googlebot requests for every URL it has ever seen, which by this point must outnumber active URLs by about 1000 to one ("a couple of years", assuming the URL really changes every day).

aakk9999

12:11 am on Dec 1, 2014 (gmt 0)

@Play_Bach,

It would be interesting if you could catch two URLs for the same keword and see what happens when you request each of them - does the old one redirect? E.g. it would be interesting request both if you are aware of existance:

www.example.com/uniqueKeyWord1-someHexStringhere.html
www.example.com/uniqueKeyWord1-differentHexStringhere.html

It would also be interesting to search for a unique paragraph on one of these pages in combination with "site:example.com" to see if there are "old" URLs with a different hex string in Google index.

Does the page have canonical? If so, what does it say?

But then on Bing, the URL doesn't have the numbers or .html component

Is there a difference in robots.txt with regards to googlebot and bingbot?

lucy24

1:27 am on Dec 1, 2014 (gmt 0)

or .html component

?! Is each page accessible both ways, with and without .html? Try a few and see.

It's sooo tempting to interpret this as meaning bing is, in this respect, more intelligent than google ;) But what I said above about Googlebot requests would go double for bing, because they never stop asking for old URLs.

Shepherd

1:45 am on Dec 1, 2014 (gmt 0)

We use this type of url to avoid duplicate urls/files/folders.

Could also be used for tracking, referral links, affiliates, and possibly gaming google's freshness algo.

Play_Bach

9:36 am on Dec 1, 2014 (gmt 0)

It would also be interesting to search for a unique paragraph on one of these pages in combination with "site:example.com" to see if there are "old" URLs with a different hex string in Google index.

@aakk9999 - OK, just did that. Google returned some 2,500 pages all with the same meta descriptions. Many of the pages have the alphanumeric strings in the URLs, many don't, so it appears to be selective about which URLs the site chooses to encode.

Adding the MD5 type string to game freshness I suspect is what's going on here. Beats me how they're getting away with it though.

trabis

2:14 pm on Dec 2, 2014 (gmt 0)

Perhaps they are hashing query parameteres (action=abc&defg=hijk =>a483ef692f4d689e436973e128c9a24). This way google cannot tell what parameters they should ignore and they have to look at the content of the page. This will probably generate a lot of duplicate content but Google will bubble up the page it considers to better answer a query. The webmaster may then infer why a particular version of the page is ranking better and plan SEO accordingly.

aakk9999

3:05 pm on Dec 2, 2014 (gmt 0)

It is possible that it is done for the reason trabis says above.

@Shepherd

We use this type of url to avoid duplicate urls/files/folders

I am confused. Are you saying you have a different URL but the page shows the same content?

Or are you using something like this just to make sure two DIFFERENT pages do not attempt to use the same URL? Kind of like an ID in URL?

Shepherd

3:38 pm on Dec 2, 2014 (gmt 0)

@aakk9999

We have several writers creating our own content and also UGC. We use article titles as page urls. So, when our system creates the page it adds a unix time stamp to the end of the url to avoid possible duplicate urls.

/green-widgets-in-oklahoma-1413241632.html
/green-widgets-in-oklahoma-1409852741.html

The pages would contain different articles or content, they just happen to have the same title.

Doesn't happen very often, but the first time it did it was a big mess, this was a simple and quick fix.

aakk9999

3:58 pm on Dec 2, 2014 (gmt 0)

Thanks for clarifying - so it is like ID in URL.

I don't think this is the same as what Play_Bach reports. From what I understood, the page content is the same over there, but URLs change, the reason may be what trabis described.

Robert Charlton

6:02 pm on Dec 2, 2014 (gmt 0)

Does the page have canonical? If so, what does it say?

This point hasn't been addressed in responses, but is likely an important clue to what's going on.

Current urls in the Washington Post, eg, appear to be adding hex identifier strings on articles... but from what I've seen there's only one version of each page, and the canonical tag carries a string that's consistently the same as the string in the url. The identifiers are 36 characters. I can imagine that somewhere in these strings, 4 of the characters might be used to classify page type or whatever.

TheMadScientist

11:37 pm on Dec 2, 2014 (gmt 0)

I would guess it's about tracking and any "gaming" that might be going on is an unintentional side benefit, but that is just a guess of course.

Have you tried to access a page from Google with an empty cache and your JavaScript turned off to see if you still get redirected?

brotherhood of LAN

11:43 pm on Dec 2, 2014 (gmt 0)

It could easily be someone's idea of "avoiding session parameter GET variables that confuse search engines" and mistakenly thinking putting it in the path makes it better. It's not a problem for today's search engines, but I've seen people draw unusual conclusions like that.

Play_Bach

1:09 am on Dec 3, 2014 (gmt 0)

Have you tried to access a page from Google with an empty cache and your JavaScript turned off to see if you still get redirected?

@TheMadScientist - Redirecting isn't what's happening here. The link on Google will go to the page even with all the numbers in it. On Yahoo, the same link doesn't have the numbers and goes to the same page. On Google, the page shows as .html and on Bing/Yahoo as a php query.

TheMadScientist

1:26 am on Dec 3, 2014 (gmt 0)

Hmmm, interesting...

Also, the URL doesn't go to the keyword page if entered directly into the address bar. It only works as a result of a query for the keyword.

The preceding sounds like a referrer-based, likely JS or meta-refresh, redirect to me, but I could be misunderstanding or reading something incorrectly.

TheMadScientist

1:45 am on Dec 3, 2014 (gmt 0)

The link on Google will go to the page even with all the numbers in it. On Yahoo, the same link doesn't have the numbers and goes to the same page.

Yahoo's results are provided by Bing -- Bing, in an effort to not have to deal with duplicate content, will not index the same content twice, so I'm not surprised there's a difference between Google and Bing/Yahoo -- Google indexes any and every thing it can then guesses which version of a duplicate people want to see. Bing indexes unique pages and when it finds a duplicate or near-duplicate of a page already in it's index it won't index the secondary URL.