Descript seems to be a stand-alone text-driven simple NLE for the Zoom crowd, with some integration to other pro NLEs. It looks interesting.
OTOH, we have a preferred NLE, which is FCP. Yes it would be great if it had voice recogition, script integration and free or very cheap good-quality voice-to-text. There are two existing apps which add that to FCP, Simon Says and Lumberjack Builder:
Fcp.co: Simon Says Assemble Released
It appears similar to the Lumberjack Builder app which also does text-driven editing, coupled with automated (or user-provided) transcription. Simon Says seems to have more collaborative features than Lumberjack: www.lumberjacksystem.com/apps/
Lumberjack uses your preference of 3rd-party transcription or you can provide your own transcripts. It is about $10 per month (not inc'l transcription). Like Simon Says it has integration with FCP and other NLEs.
High quality, multi-speaker, "untrained" voice-to-text is difficult. However the technology is obviously available. There is no human being doing all the transcription on Youtube videos. Why this is not more widely available as a standard feature on pro NLEs is a good question. Personally I would use it much more frequently than, say, editing 360-deg. video.