I noticed that I have hundreds of pages indexed that don't exist on my website. For instance,
They are almost always spammy keywords for drugs and games etc. When I click on these pages, all I see are blank pages with empty code. When I check the server, these pages simply don't exist.
In addition I have many other pages indexed with URL's like this and when I click on them I do end up either on the archive/category/post page but then these are duplicate pages because ideally someone should be able to reach them directly:
My wordpress installation is totally up to date and I asked question a while ago in another place with the answer being "google will index any page that returns a 200 status code. if someone links to a non-existant page on your site and it returns a 200 status code, google will likely index it."
Will this do it? What else do I need to do? How do I make sure that whoever is linking to these pages will not be able to achieve this result? Is there a way to test that the 404 page is in compliance? By the way, why would anyone try to link to these pages? I have noticed a lot of Russian and Polish spam websites linking to my plastic surgery website. Why would they do it? I thought getting links was difficult but these people probably mean harm.
You need to look from the other end. What happens when someone requests a pseudo-directory (I assume you're rewriting from displayed directory to actual query) or uses a bogus query term? What should not happen is that your site goes ahead and creates a page.
A custom 404 page is very very unlikely to be the problem. Error pages are for humans. Robots just note the 404 and carry on. Unlike humans, they can even choose not to follow redirects.
Even in the Parameters area of gwt and similar, there's no way to say "Ignore everything except...." So you need to make sure that spurious parameters-- or impossible values-- aren't being processed in the first place.