Actually!
Audio (in-general) is kind of an outdated standard. Meaning an image file(a spectrogram/spectrograph visual representation of an audio file) is superior to an audio file in a lot of ways. In terms of upscaling, there are more algorithms(research and development) that have gone into upscaling images and video than in audio… While kind of subjective, one can argue and agree that there is a lot more information in a still image than an audio file (depending on file sizes and whatnot).
The problem with audio is that there has never been an official image format for audio. Meaning there is no standardization of image files for audio. 99% of the recording engineers who record vocals are most likely to record vocals as a wav file as-oppossed-to a image file (because there is no format for recording audio straight directly to an image file). The format is old and to upscale an old format like that wouldn’t make sense (because you’re working with old philosophies and old outdated ways of thinking). Something like scaling can be done but it would only make more sense to do it with an image file as-oppossed-to a wave file.