1. some malformed HTML will cause some spiders to miss some content (or misinterpret some content, skip some links, etc).
Unless you know precisely what affect those malformed tags have on the spiders you care about, best not to deploy them.
2. indexing the content a spider has retrieved is a huge, high speed, operation. Broadly and crudely speaking, the stages are:
a. parse it to find text and links
b1. pass the links and anchor text to the spider for further retrieval
b2. pass the text to the indexer for indexing.
The first bottleneck is the parse. So there is likely to be more than one parser: a simple high speed one that rips as much content and links as it can. Then a slower, more precise one that handles the cases that the first-pass parser rejects.
That way, the spiders and indexers are being fed as fast as possible.
But a minority of pages (those that trip the simple parse) get put on the back burner for later handling.
So, in effect, some quality signals lead to slower indexing.
Do I know that for sure? NO.
But if I were building the backend to a search engine's indexer, it would be the approach I'd take. Most other approaches would slow things down too much.
Would I want to take the risk that malformed tags lead to slower indexing? NO.