| 9:30 pm on Jul 29, 2006 (gmt 0)|
Did you build your site up from scratch, i.e text editor, or did you use commercial templates or something like,,, dreamweaver, frontpage,,,,,,
Or are you running an ecommerce package
| 10:04 pm on Jul 29, 2006 (gmt 0)|
It might be wise to look up what punctuation marks these codes represent in that URL:
I would also be interested in what this part is for:
| 12:46 am on Jul 30, 2006 (gmt 0)|
YOUR site does not use those characters, but the site where that link came from DOES use those characters. That's why Google was looking for that URL, it's not because of your site, but some other site like a scammer scraper phony SERP site has a link to your site and when Google crawled them and found your URL there, they came to your site looking to index that URL where Google was met with an error from an improperly formatted URL.
It might also be the 404 URL that Google generates to verify your site when you first signup to Sitemaps. A lot of people forget, that as part of the verification process, GOogle sends a purposefully incorrect URL to your site to make sure you have a true 404 Page Not Found page. SO it could be that also.
As we have found with our site, our links appear on thousands of those phony SERP scraper pages, which are merely repackaging the top 10 overture or top 10 Google adwords pay per click results for a particular keyword.
So the quesiton is, do you advertise on Adwords, or Yahoo's pay per click searches?
If so, what you're seeing is probably remnants of a scammer's adwords or Overture affiliate code used to generate the URL leading to your site from his phony SERP page. It's easy to spot the overture and Google links to your web site on these phony SERP pages. Here's how:
Let's say you search for widgets on Google, and click on one of the results that looks promising to you. Turns out you just land on one of those SERP pages that is just another listing of 20 search engine results, with some adwords thrown in. Then you see your site listed there.
You'll see the link to your site that when you mouse over it, the browser status bar shows you www.example.com, but if you actually do a view source on the scammers web page, you'll see the URL is some real long URL like the one you showed in your posting to start this thread.
Usually though, it will be like 2 lines long, and start like this:
Scammers are sneaky this way, they tell the browser to show you one URL to trick you into thinking you are going straight to your site, but in the HTML they actually use the lengthy Google ADwords or Overture PAID link to your site. So people click on the "link to your site" not knowing it's actually a sponsored link, that you are paying for. The affiliate links from Google adwords and Overture are long, with tons of encoded characters.
So why did you get the 500 error?
It could be they are trying to wreck your server on purpose, or they just screwed up generating the code for your site, and your server did not like the format, and choked on it.
THe fact that you got the 500 error and NOT the 404 error makes me think that this was NOT Google performing the 404 test as part of verification. But check your log file and see if this error occurred at the time you were performing your initial Sitemaps verification.
My gut feeling says this was a scammer pulling some kind of trick that back fired on him. He won't earn any money as an Adwords affiliate if the link does not go through!
Hope this helps!
[edited by: JeffOstroff at 1:11 am (utc) on July 30, 2006]
| 7:29 am on Jul 30, 2006 (gmt 0)|
The thing is that we've been hooked up with sitemaps since November of last year so our site was verified a long time ago. It threw the 500 error because we use asp.net and it searches for the query string .. which normally would come after a question mark, but because some of the elements in what should have been a query string were illegal characters in that language like the single quote it threw an error. It's part of our protection against SQL injection etc.
Also .. google sitemaps now has a way to verify your page through a meta tag which is nice since most sites dont like throwing potential customers to a 404. I personally like to put them at a page that allows them to easily search again.
I wouldn't say it choked, but it definately burped. We have had a lot of problems with blasted scrapers. Nearly every weekend I spend half a day searching for them and reporting them.
We use completely custom code for our site. No templates.. no wysiwyg editors. Totally from scratch.
We do use a few adwords and some overture since things have been slow for us this summer, but why would google try to spider their own ppc link or an overture/yahoo/whoever link to begin with.
I wish there was some way .. on the sponsored links that say .. a link gets clicked on, the link actually fires back a response to google, then say google scans the page it came from for quality and then decides whether or not it's a worthy source. If it isn't then parse the url and send the customer to the site at no cost to us and if the site comes back as questionable then they are put on some sort of probabation or placed in a que for manual review.
| 8:21 am on Jul 30, 2006 (gmt 0)|
OK .. looked up those characters and here's what I found
%7C = ¦
%5B = [
%5E = *
So the bad querystring would really look like &y=02D2B45A1CDC1231&i=357&c=9315&q=02*SSHPM[L7.&&'?~¦jm~%
when I ran a google search for it under
I got ALL kinds of supplimental pages and really spammy ones. I mean big time spammy ones. I did find that the & symbol is used in php pages instead of the? to define the beginning of a querystring.
the only valid link one one of the pages found was actually a wikipedia page in germany.
| 4:16 pm on Jul 30, 2006 (gmt 0)|
I had a hunch that unescaping the punctuation might lead to more clues, but didn't have time to run that test myself.
| 7:35 pm on Jul 30, 2006 (gmt 0)|
i'm the novice here,
but the reason i asked my questions was that surely sitemaps only checklinks and urls that originate from your site?
Are you guys saying that they check links originating from 3rd parties?
| 7:56 pm on Jul 30, 2006 (gmt 0)|
From what I'm seeing it is checking links that even come into your site.
There were several site crawl errors shown for:
mydomain.com instead of it being www.mydomain.com since we have a 301 redirect for those to go to the main domain structure. it showed "domain not found" but obviously google knew it was ours since it showed in our sitemap. Our sitemaps show the full url and have since last November.
I just wish that it would tell us where the links were found so if one of them is bad we can have the other website fix it .. or if it's a spammer site we could find it easily.
A month or so ago we had to implement these for fear that
were being seen as duplicate content and from what we feared the https versions were and we were hit with a penalty even though we were totally innocent of spam.
a note on my post above .. the link that appeared to be an internal wikipedia link I don't believe was to our site and I certainly don't speak German.
[edited by: Bewenched at 7:59 pm (utc) on July 30, 2006]
| 11:30 pm on Jul 30, 2006 (gmt 0)|
|surely sitemaps only checklinks and urls that originate from your site? |
Nope, they absolutely do check links coming from other sites.
I find it very annoying that they don't say where errors originate from (except when they stamp it as 'sitemap' originated - thats wonderful and I'm so glad they do that!) because what chance do you have of correcting a link if you don't know where the error is?
At least onsite or off, please, guys! Actual page would be even better!
(Yes, of course they read these threads!)
| 12:22 am on Aug 1, 2006 (gmt 0)|
Wouldn't your log file show you where the error came from? we use Webtrands which often shows you top referrers, but the log should have the ip address of every referrer to your site.
For example, your apache server would have an error log and a regular access log. I think it show sup in both, as the error log only shows errors, but the access log shows everything. I typiclaly just run webtrends on the access log, and if it does not parse what I am look for, I open up the error log and access log in a text editor and hunt it down.
[edited by: JeffOstroff at 12:24 am (utc) on Aug. 1, 2006]
| 12:49 am on Aug 1, 2006 (gmt 0)|
OK .. looked up those characters and here's what I found
%7C = ¦
%5B = [
%5E = *
Those characters are all used in html and most of the time it happens from a cut and paste operation.
More than likely an href tag not being properly closed!
Have you run xenu link slueth on your site? Xenu will help you pinpoint the error if its on your site.
| 1:04 am on Aug 1, 2006 (gmt 0)|
Keep in mind, it might not be your site though, it could be an error in the html from a site that links to you.
| 2:05 am on Aug 1, 2006 (gmt 0)|
|Wouldn't your log file show you where the error came from? |
Not if the other page is obscure and the link has never been followed by a visitor - and Googlebot doesn't give referers
As well, as my sites are all generated, and it should happen to be some bad code internally causing a dud link (no, no of course that never happens, I am talking theoretically ;)) then a source code search probably won't find it. So the backlink is crucial.