Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: open
The 100k limit, AFAIK, is only a limit of the page size stored by the cache, I believe a number of sources have confirmed that the information not cached in pages > 100k is still indexed.
Although it's obviously a huge Usability blunder to have a page that size and very very bad practice.
You mentioned that the best way to hide links from Google would be to block cgi-bin through robots.txt and then use a script which gives URL's of the format
where the id represents a URL
I have 2 questions
1) How to implement this? Could you give me links to such scripts as I am unable to find any?
2) What happens in general when Google sees a link but cannot follow it. Do such links contribute to PR leak?
Suppose I have 10 links on my page of which 4 are hidden via such a script. Would google distribute the PR to all the 10 links or only to those 6 visible links?
I did not pick this one but it seems to do the job just as well as others i've seen. You just have to look at "-Shortened URLs" in the "readme.txt" file for guidance on using IDs.
2) PR leak.
First, there's an understanding that a PR Leak is not a leak. Some PR is passed on from your page to the pages you link to, but your own page does not lose any PR in doing this. I've seen this statement in a few threads, here are two examples:
A link like that would be a link to an internal part of your own site. A part which is access restricted. I would assume that this makes it quite impossible to decide where the 4/10 of the PR should go to. In order to "leak" it must leak to "somewhere" - at least that is the way i understand it.
So, your 4 links will not leak. That is: They will not pass any PR on, as there is nothing to pass it on to. Your six remaining links will pass on 1/6 of the total amount passed on. That is, the total amount "passed on" will be the same, it will only benefit 6 pages in stead of 10. As 1/6 is greater than 1/10 this is good for the pages that get some.
<edit>typos, clarified text</edit>
Everybody: I will remember to all of you that in post 23 I explained what is Java and JS. Please name things by their name before I join Made_In_Sheffield's autodestructive behaviour (and my walls won't last too much, so I'll break my neighbours' ones!)
claus: You said that a link can be hidden in a Java-applet, but I'm not sure: linking outside the applet's home domain is valid with Java security policies?
Imaster: I'll give you a quick definition 'bout PR leak:
A page gives PR to the pages it links to without losing its own. If the other page links back, then it'll give more PR to the 1st page. If your 1st page has more links, each one will carry less PR, so the 2nd pages that link back will return less. The page never losses PR by linking out, but it might happen that it gains less than before: this is what is called PR leak.
Google definitely has followed the java links on my site. I have text links on the bottoms of the pages now, but Google indexed the pages before the text links were there.
The wierd thing is, I have a client's site that uses the same type of java scripting, and the Goo didn't make it past the first page the first time it got crawled.
GoogleBot might have visited the pages in question, if you visited the page and had the Google ToolBar turned on to send data to Google.
Thats how Google finds new sites/pages...
If a page is not linked to by another page then it is not part of the web as Google sees it and will therefore not be added to the index by my understanding. I think the only way Google finds pages is by following links from other sites. I'm pretty sure that submitting your site to Google does no good unless you have a link from another site (which kind of makes it a pointless exercise apart from making you feel better). Anyone want to correct me on any of that?
I do however think Alexa does find sites in this way.
Nobody's saying Google can't find a page without links. What I am saying is that they won't add it to the index> until they find some links.
This is because of the way the PageRank alrgorythm works.
Why do you think pages get dropped from the index once you take them out of your site navigation but leave the pages there? Because they are no longer linked.
It takes a while but ti does happen in time.
Without looking in your log file and seeing what the User Agent was you can't blame that on Google, do you have the Alexa toolbar installed?
This question about hiding links from Google was discussed some months ago and Googleguy made his comments clear about this topic:
Googleguy comments about hidden links [webmasterworld.com]
his comments are page1, fifth comment
Personally, i use a php redirect script hidden from the bot using robots.txt. Google is gracious not to visit pages declared DENY by robots.txt
It's great for hiding repeated links to terms and conditions, privacy statements or adverts.
Read what Googleguy said, hope this helps.
"4. Use a redirect script and use robots.txt to disallow your script. (RECOMMENDED!) googlebot won't even touch your script."
then GoogleGuy replied:
"Yah, robots.txt or meta tags are your friends.. Those should both work fine."
If you just want your users to find the links, you could just put them on another page and setup a prominent link to that page.
The code is:
The code is:
Think of it this way. Forget HTML, consider the page to be simply text. Finding absolute urls on the page is easy, and if the search engine spiders wish to do so they can follow those urls whether or not they are links.
Hope this helps,
I think Patrick_Taylor's Flash example would do the trick though. Wonder how long this will be (since google and macromedia alredy have some kind of google flash spidering thing going on).
There are a lot of cgi links pointing to my site. These are not read by spiders (drat, drat and double drat).
If links can be hidden from the search engines, I'm going to start exchanging links with any site, including link farms. I don't recall anyone that wanted to trade links ever asking if the link would be picked up by Google and the other search engines.
That won't help you with linkfarms cause they have bots that check if you're linking back to them etc.. and the bots won't see the link. They usually specifically require <a href="URL">DESC</a> link.
I guess it's all guessing untill we test it. But since ppl are reporting about Googlebot taking their .js files google must have some agenda with them..
i.e. the url http://www.xyz.com/dir/s.asp?l=111 actually leads to abc.com and if we check the backlinks for abc.com, the link http://www.xyz.com/dir/s.asp?l=111 does appear.
Examples used above are fictitious ;)