|changing a name|
Two questions really: a "whether" component and a "how" component. Can't do the "whether" without naming names, so let me just as about the "how".
Background: I've never liked my domain name. It dates back to my games, which are currently one of eight directories. Now I've come across a name that I absolutely adore and is perfect for me. Not for anyone else, I guess, because all the major tld's are vacant and it didn't cost me anything. It even fits my favicon better than my current name does.
So I'm thinking very seriously about moving. First step obviously is to sit on the question for a week or two to make sure this isn't a case of temporary insanity. (Those who have snooped will understand that this is all about The Principle Of The Thing; there's no money involved.)
Question: What's the smoothest way of changing names? What I'm looking at is leaving one or two directories-- plus some never-indexed personal content-- at the old name and moving the other six or seven to the new one. I can do the mechanics of redirecting. Existing external links can be counted on the fingers of one hand, so I won't even think about those. A few include files will need minor tweaking; a bout of rigorous link-checking should take care of cross-directory internal links.
What have I overlooked?
If all your internal links are root-relative and you are not changing URL structure (other than a different domain name) then you have it mostly covered with regards to URLs.
Check whether you refer to your old site/domain name anywhere (page <title>, meta description, content) and change it too if required.
If there is an email address anywhere on the site that uses the old domain name, you may want to change it to use the new domain name.
Is the old site in Webmaster Tools? If so, are you going to execute "site move" in Google Webmaster Tools? If you do, there are few steps with regards to old site's robots.txt (must not redirect), old sitemap (must not redirect, and should have new site URLs).
Crawl your domain with a tool before the move and save URLs in a list. Crawl your old domain with a tool after the move, to make sure all URLs redirect. Crawl your new domain after the move, to make sure the old list = new list (apart from directories you are leaving on the old site).
If your old domain has "Google Places" page, there are additional steps and recommendations, to make sure you do not "lose" the places page.
|Check whether you refer to your old site/domain name anywhere (page <title>, meta description, content) and change it too if required. |
TextWrangler's multi-file search turned out to be a huge help. (Spotlight only brings up the front end of html files, not the raw code.) I knew I had a couple of links from /paintings/ to /rats/ but never realized there were seven of them. And two in the opposite direction. And I would never have remembered the lone link from /games/ to /hovercraft/.
For insurance, the new site's htaccess will have redirects in the opposite direction for the non-moved directories, in case something goes wrong and someone shows up in the wrong place. Or, ahem, if search engines start firing off requests at random.
I couldn't find a partial-move option in gwt so they'll have to take their 301s and figure it out from there. (I don't think I would gain anything by telling them I'm moving the whole thing, and then turning around and not moving 2/8 of it.)
It's a bit like moving house isn't it :) An opportunity to look around and say "Do I really want to pack that? I haven't used it since 2008, and I'm not even sure what it's for" and "It will cost more to haul this to Yellowknife than it was worth new, so let's toss it".
:: bump ::
#1 First I copied all content to the new domain name, keeping it roboted-out, and ran exhaustive link checking on the 23d.
#2 Then I visited with family and tried hard not to think about it.
#3 Finally I continued sitting on my hands until the 28th, when I did following in rapid succession:
--replace host's placeholder front page with real front page
--put in new htaccess with all relevant rewrites and redirects
--change robots.txt to reveal new site
--change old site's htaccess to temporarily serve 503 on non-redirected content, giving me time to update all includes, boilerplate and cross-links
--add new name to wmt and G### profile
The elapsed time from beginning to end of this list was just long enough for g### to randomly crawl one page and pick up a redirect to the new site. Along with the new page (identical URLs throughout, just new domain name) it got the page's CSS, giving page as referer. It also asked for robots.txt; this may be significant.
The faviconbot also stopped by. Thanks to the "Firefox/6.0" in its new UA string, it got redirected to the "Sorry, but the server thinks you are a robot page", but this didn't stop it from getting the favicon.
I goofed on a few image redirects because I never realized how many different URL paths I have with the element "rats/". Oops. Ahem. They're fixed now, and I'll ignore any fallout.
The only real mistake was forgetting that one of the moved directories has a custom error document ... which lives in that directory. I neglected to exempt it from redirecting, so now the bingbot knows of this URL's existence. Normally when this happens, I serve up a 410 to any explicit requests for a named error document. Unfortunately, this one is the 410 document. Oh, dear.
Here is what happens when you add a brand-new site to google webmaster tools. I give this in detail in case other people find it interesting and/or useful.
#1 front page and site-verification code with "Google-Site-Verification" UA.
#2 front page with humanoid UA from 72.14 IP
#3 front page with preview UA, giving www.google.com/search as referer; also all images and styles, giving page as referer. I'm pretty sure the sole purpose of this step was to show me a thumbnail of the site's front page for wmt purposes.
#4 two redirected requests for front page. I think this was when I added the with-www. version, required so you can express a preference. Then two non-redirected requests, and one each for verification file. All this with Google-Site-Verification UA. (Yes, again!)
#5 redirected request for front page by faviconbot. I don't know whether this was a with-www request redirected to without, or a without-www request redirected to old-browser page.
#6 redirected request for front page, with google search as referer and humanoid UA. Also request for piwik.js, but no other files.
pause here for eight minutes-- exactly!-- before moving to next phase
#7 redirected requests for robots.txt and front page, followed by request for front page again. No non-redirected request for robots.txt; that's why it may be significant that it had already seen robots.txt about 15-20 minutes earlier. From here on, everything uses the Googlebot UA in alternation with Googlebot-Image, typically at intervals of about 5 seconds.
#8 after pausing for about a minute, the real work started. Crawl all pages linked from front page. Then crawl images (but not css or js) linked from front page and two (of six) directory-index pages.
#9 after further pause of almost half an hour, continue crawling to pick up second-stage links. I suspect that at this point the computer did some preliminary checking and decided that all the images were the same as before so it didn't need to crawl the rest of them.
#11 meanwhile, the refer-less track continued crawling. I think it ended up picking up every page on the site-- including all pdfs. There was also a fresh set of denied requests for midi files (above), plus another batch of midis living in an unrelated directory, and a handful of images linked in <a href> form. It trickled out with some pdfs that it had probably forgotten about.
#12 mixed in with #11 were a handful of requests from Googlebot-Mobile. I'm not going to guess at a pattern here; they may be random.
|Normally when this happens, I serve up a 410 to any explicit requests for a named error document. Unfortunately, this one is the 410 document. Oh, dear. |
wouldn't a conditional check of THE_REQUEST do the job for you there?
|I give this in detail in case other people find it interesting and/or useful. |
i made a bowl of popcorn
I like your new domain name Lucy, I knew there were a lot but didn't know it could be that high.
|wouldn't a conditional check of THE_REQUEST do the job for you there? |
No, because an error response preserves the original request. That is, the server doesn't know whether it's handing over the 410 page in response to a direct request for it, or in response to an internal request for the 410 error document. I've had the same trouble in the past with the 403 page: you can't simply deny access to anyone who asks for it, or you get an infinite loop. You have to respond with a different error code.
|I knew there were a lot but didn't know it could be that high. |
Even Franz Boas Nods. If you're talking discrete lexical items, there may be ten or so. Beyond that, and you're getting into polysynthetic issues.
Oh, dear. I hope it didn't take you 100 minutes to read it :)
any "explicit requests for a named error document" would show up in THE_REQUEST.
i'm assuming an "explicit request" in this case is an HTTP GET Request by the user agent.
You may be more familiar with the 403 version of this scenario-- the one you see when you lock someone out but forget to poke a hole for the 403 document. The 410 version behaves exactly the same.
User asks for "410.html", using its exact URL
RewriteRule says it ain't there no more
Server goes to fetch 410 page to tell them so
Internal request for 410 page leads to fresh cycle through htaccess file with fresh 410 status leading to fresh request for 410 page leading to...
et cetera. The [NS] flag has no effect.
The user does receive a 410 response-- but only at the cost of a server error due to maxing out on internal redirects. This strikes me as a pretty brutal way to prevent someone from seeing the 410 page.
The Adventure Continues...
Two weeks since the move. We're now in the part where the search engines figure out what the ### is going on, since it wasn't an across-the-board change. I did think about using the Address Change feature and then letting them figure out for themselves that two directories aren't there any more. But at best it would cause confusion and at worst they'd think I was up to some hanky-panky, so I decided not to.
Observations on human behavior:
-- a really enormous proportion of image traffic is to one of the directories that stayed behind. I never noticed this before.
-- a fair number of people have pages in a particular subdirectory bookmarked. Somehow Firefox's favicon reloader knows which site to ask, even when it has never been redirected. I do not pretend to understand this.
Observations on crawling, gleaned from tracking redirects:
-- as expected, some ebooks hardly ever get crawled. Static text, so once it's there, it never changes. This is especially true of the larger files. The googlebot does seem to be entranced by the General Index to one six-volume collection. Best guess: It's because the file has 19,000 links (some of which had to be hand-checked ;)) to other files in the same directory. I did say it's an index. Humans understand this, but the robot needs to keep checking to see whether any of those 19,000 links lead to other sites. (I could no-index it. There's no unique content, just links, so I can't imagine a human searching for it "cold". But I don't think it would reduce crawling.)
-- I had no idea the major search engines had so little interest in my gallery pages. Some have yet to be re-crawled. (I'm talking here about requests to the old site. The new site was fully crawled right away, at least by google.) They have almost no text, and nothing ever changes except added thumbnails and a word or two of new links. Cursory searching suggests that the googlebot's minimal rate is about once a month.
-- The Russian robots (Yandex and mail.ru) seem especially interested in the /paintings/ directory. Makes sense; you don't need to know English to look at pictures. Aside: I've toggled on mail.ru over the years. Currently they're allowed to get pages but not images. But I never realized how often Yandex crawls certain pages that by its own admission it isn't able to index. Are they hoping that one day the pages will magically change to English?
-- about a week after the move, the major search engines went wild with excitement as they discovered that a particular subdirectory was no longer roboted-out. Excitement was short-lived; the pages redirect to the other site, where the directory is roboted-out.
Corollary discovery: As I'd suspected, "nofollow" doesn't mean "pretend you haven't seen this link". It just means "don't tell them I sent you". Some of those newly accessible pages are old ones, from before the subdirectory was roboted-out, or even before it existed. (Can't remember if I made both changes at the same time.) But at least one requested page is newer; the only way the search engines can have learned of its existence is by a nofollow link.
Curious detail that I just thought of: They never asked for the directory-index page at the old site, though search engines normally do this eventually. Maybe they only ask if they've successfully crawled elsewhere in the directory? They did ask-- usefully-- for a page called "contents.html" that dates back to an earlier directory structure. It never occurred to me to redirect it; I have now done so.
Things I've learned from Webmaster Tools
-- according to google, about half of the new site (108 pages) is now indexed. This number is figured weekly; the previous week it was two. I noticed when one popular page got re-indexed, 5 days after the move. Others started trickling in two days later. Visits from search engines are currently a mix of direct and redirect.
Also according to gwt, no pages (none, 0, zero) on the new site have been crawled. Possibly the different branches of google's computer had a quarrel and are now not speaking.
-- so far, the keyword list at the old site hasn't changed. Apart from some obvious ebook giveaways, I can tell because "rat" is not yet back in the first 10.
Via This Intermediate Link
A while back, in a different context, I posted under the title "via this intermediate figment of the imagination". Currently wmt is very, very confused in the "who links to your site" area. Each site is listed with scores of links from the other one. (There are actually about ten in each direction-- internal links that became external due to the split.) Every single one has the appended "via this intermediate link" line. For pages on example.com, the "intermediate link" is given as example.org/name-of-page. For pages on example.org, the "intermediate link" is example.com/name-of-page. Yes, in both directions, although redirects only go from example.com (old site) to example.org (new site).
Eventually they'll figure it out.
This phenomenon is unique to google. As far as I know, bing doesn't do "intermediate links".
-- along the way, I picked up a detail about Preview: their default font is sans-serif. Anywhere that your page doesn't specify a font or family, you'll see it in sans-serif. If you do specify, they'll use what you name. They'll even show embedded fonts. I think that's why my non-Roman text displays as empty boxes: all human browsers use font substitution, but the preview renderer doesn't.
As we pass Week Three:
-- Google Webmaster Tools has finally admitted to crawling the new site. The Index Status area of wmt gives weekly figures, not daily. Up until yesterday it claimed to have crawled zero pages ever-- even while the Total Indexed number continued to climb. This would seem to be, well, impossible. (Yes, I realize it isn't technically impossible. But I just don't have that kind of backlink profile :))
Finally today they changed all numbers ... retroactively, going back two weeks. "What do you mean we said we'd never crawled anything? We did it all weeks ago!" The "ever crawled" figure is now identical for January 5-12-19. I suppose it's the total pagecount of the whole site, which they crawled way back on the day of the move.
Counting on fingers suggests that there are currently about 100 dually-indexed pages (the difference between new site's increase and old site's decrease).
Minor oddity: At the old site, the "ever crawled" number jumped by about 20, as did the "blocked by robots.txt" total, at a time when I definitely wasn't adding any pages. I think this has to do with the directory that I unblocked because it's no longer there. The "ever crawled" number goes up because they dutifully put in requests for pages (there are really about 40, but search engines have no way of knowing about the others); the "blocked" number goes up by the same amount because the attempted crawl leads to a robots.txt exclusion at the new site.
The 60-odd pages removed from old site's index are-- again, making my best guess-- the pages that google finds most interesting and therefore crawls most often. With a handful of exceptions, I have no idea which ones they are :(
-- I made a belated but happy discovery. Bing, unlike google, has a directory-scope Site Move option. In addition to telling them that an entire domain has moved elsewhere, you can also tell them that example.com/directory/ has moved to example.com/otherdirectory/ or, in my case, to example.org/directory/. I can't imagine why google doesn't steal this idea; it's a good one.
-- Elsewhere in Bing territory: I discovered while looking up something else that I was wrong about how they found my custom 410 page. It had nothing to do with the site move; they've been asking for it by name since last June. Of course at this late date I have no idea what I did in June to bring the page to light; they did land on two different 410s in the days before first requesting the document. But if they hadn't known about the page before, they would eventually have learned about it if I hadn't fixed the site-redirect code. So thanks, Bingbot!
The page now gets an explicit [R=404,L] response. It looks droll, but is the only practical approach.
-- The "via this intermediate link" numbers continue to rise. Sigh.
Lucy -- This is a bit hard for me to explain, and maybe you answered it somewhere already since I don't know enough to follow some of the details of your description, but I would like to ask about the old internal links between pages that weren't moved and pages that were moved. It seems to me that there are three possibilities for what could be done with them:
1. You could change them on the old unmoved part of the site so that they point directly to the corresponding pages on the new site.
2. You could leave them as they were and let the new redirects handle the transfer.
3. You could just delete them, thereby breaking the old connections between the unmoved pages and the moved pages.
Well I hope that makes sense, and if so, can you please tell me what you did
|I would like to ask about the old internal links between pages that weren't moved and pages that were moved |
In fact it's something I looked into very throughly, aided by a text editor with multi-file searching.
FOR links to
... and vice versa. All hits were changed to
And then I did some manual editing. A few cross-links were only there because, hey, it's the same site, this might amuse the reader; once they're on different sites there's no longer any point. Which is why the phrase "irretrievably lost" on the Xanadu page no longer links to the discussion of the Inuktitut word ijagattuq.
:: detour to beginning of thread ::
Oh, yes, I did talk a bit about that. First and third posts (responding to Sandra's question). I also added some blahblah on the respective sites' Contact pages for the benefit of former visitors looking around blankly "Uh... didn't the couch used to be against the other wall?" For the first couple of weeks I had the same thing on the old site's front page, mainly to fill space.
As long as we're here:
I've been checking google's wmt every week for development on the "via this intermediate link" front. As of last week they rinse clean on the old site: the only reported links from new to old are the bona fide direct named links. The new site still lists over 100 links from the old site-- all of them naming a page (generally a moved one) on the old site as the "intermediate link". Those are the pages that don't get crawled very often, either because they're no-indexed or because they simply don't change (ebooks mainly). The actual count is, I think, two excluding error documents. ("Looking for something? It moved to example.org".)
They're now down to 56 pages indexed on the old site. (The correct count will be closer to 10.) "Content keywords" still haven't caught up; they're basically identical on both sites. I expect this to take a long time.
Way back before I even made the move, I said hypothetically:
|the new site's htaccess will have redirects in the opposite direction for the non-moved directories, in case something goes wrong and someone shows up in the wrong place. Or, ahem, if search engines start firing off requests at random. |
Would anyone care to guess what happens when you tell bing that certain directories are moving?
Oh, hiya, bingdude, didn't see you back there.
Yup: Every page that has ever been redirected or received a 404 or 410 will now be requested on the new site-- where, of course, it never existed. Honest, bing, it's gone. It wasn't just biding its time until I got a new domain name.
Thanks for your reply Lucy. I figured that you had already given some careful thought to this.
The aspect I'm wondering about is the amount of interlinking between what are now two independent websites. If you ended up with a lot of interlinking, it might not look natural to Google's algorithm. I know you just said that you got rid of some of the old interlinking, but i'm thinking that it might be best to eliminate all of it that isn't essential.
Anyway that's why I asked the question.