homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / WordPress
Forum Library, Charter, Moderators: lorax & rogerd

WordPress Forum

Where does GSiteCrawler get its urls from?

 5:49 pm on Apr 26, 2014 (gmt 0)

I have used GSiteCrawler (Sitemap) to check urls on my website and for some reason it shows urls (before the actual permalink) with the likes of <strong>?p=</strong> at the end of a url... (and the content of the url is the same as without the extra characters)

I am wondering where GSite Crawler gets these urls... as I also have some of them indexed by Google...

Anyone any ideas?



 6:02 pm on Apr 26, 2014 (gmt 0)

Sounds like search results page URLs, what are your URL structure settings? (Permalink settings?) Are you using any plugins to help manage URLs - usually called SEO plugins?


 6:10 pm on Apr 26, 2014 (gmt 0)

Just started using Yoast - having switched from AIOSEO - and using post name...
%%title%% %%page%% for Posts Title template:
and having just seen that... may be where the problem is...


 6:59 pm on Apr 26, 2014 (gmt 0)

Yoast lets you choose what gets indexed, maybe take a look at your current sitemaps and make sure that you are only submitting each URL in one format and that it is the format you want.


 6:59 pm on Apr 26, 2014 (gmt 0)

I have found this quote from Google Groups....

"This does not match my experience. GSiteCrawler is coming up with URLs that do not currently exist on the site, they only exist in the Google index. (Those URLs did exist on the site say, 2 weeks ago, but no more.)"

This may explain the problem... but it seems odd that GSiteCrawler doesn't search from the website and not from Google's index...


 7:26 pm on Apr 26, 2014 (gmt 0)

Thing is, you don't need to use GSiteCrawler since you have an integrated sitemap generator that does what you tell it to. I would use those new sitemap results, maybe compare them to older lists just to be sure everything is listed as you prefer. If you have a GSiteCrawler plugin, it may be relying on some cached information.

Are your sitemaps still being generated by the GSiteCrawler? I would switch over to letting the Yoast plugin handle them, you have much more control over what gets listed and I would think it isn't a good practice to use more than one plugin for a function simultaneously.

I used to use the AIOSEO plugin, years ago, but switched to have control over noindex tags and not just list all the possible URLs.


 7:32 pm on Apr 26, 2014 (gmt 0)

I am only using GSiteCrawler at the moment to see if I can find what 'hidden' urls might be in Google index.... I use sitemap generator within Wordpress..


 7:48 pm on Apr 26, 2014 (gmt 0)

As to where it gets the URLs, part depends on if you are using this as a plugin or a desktop app(and in that case, the version number can matter as well). As an app it is error prone, relying on its own database and not necessarily cleaning up old information.


 7:54 pm on Apr 26, 2014 (gmt 0)

Sorry, I see you responded while I was typing. As a desktop app, it stores old URLs in its own database. For what Google has indexed you might try using the Site: operative in a search and compare that information. Sort of depends on if you have dozens, hundreds or thousands of URLs to verify.


 8:15 pm on Apr 26, 2014 (gmt 0)

Thanks... I have used the 'site:' query but struggling to find the rogue urls that Google has in its index... what I really would like is a 'tool' that would show the urls that Google in its index.... (I guess who wouldn't)


 3:38 am on Apr 27, 2014 (gmt 0)

have you tried another site crawler to see if it discovers those urls on a fresh crawl?
i would give xenu linksleuth a shot.
it will show you the pages linking to a crawled url with the p parameter.


 10:26 am on Apr 27, 2014 (gmt 0)

I do use Xenu and that doesn't pick up the extra links...

Interestingly GSite Crawler has picked up a post/url yesterday... (Wordpress generated post/url etc) and has shown in the url list:

as well as exampleurl.com/post-name

and checking the exampleurl.com/post-name/feed in http status checker

'interesting' how GSiteCrawler gets its urls...


 2:13 pm on Apr 27, 2014 (gmt 0)

perhaps GSiteCrawler is also crawling the link elements in the html document <head>.


 3:27 pm on Apr 27, 2014 (gmt 0)

"perhaps GSiteCrawler is also crawling the link elements in the html document <head>. "

It does exactly that... I have removed the feeds from the head and re-run GSiteCrawler and the 'feed' urls have gone.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / WordPress
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved