Conversation

Notices

Embed this notice
Yukari Hafner :v_lesbian: (shinmera@mastodon.tymoon.eu)'s status on Saturday, 04-Jan-2025 08:30:34 JST Yukari Hafner :v_lesbian:

I'm off to bed now, but in case anyone has thoughts about this I'd be all ears:
Any ideas on how to do very simple human voice recognition? I just want to detect whether an audio stream is likely to be a voice or not, to improve the accuracy over a simple volume based approach that most chat things use.
The best I've come up with is checking the largest frequency bin and whether it lies in a normal vocal range (100-8k Hz), but that seems like it'd also have lots of false positives.

In conversation about 5 months ago from mastodon.tymoon.eu permalink
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 08:30:33 JST screwlisp
  in reply to
  
  @shinmera matched filter approach based on fragments of the person you expect to hear talking talking?
  
  In conversation about 5 months ago permalink
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 08:32:17 JST screwlisp
  in reply to
  
  @shinmera https://en.wikipedia.org/wiki/Matched_filter
  In conversation about 5 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: upload.wikimedia.org
    
    Matched filter
    
    In signal processing, the output of the matched filter is given by correlating a known delayed signal, or template, with an unknown signal to detect the presence of the template in the unknown signal. This is equivalent to convolving the unknown signal with a conjugated time-reversed version of the template. The matched filter is the optimal linear filter for maximizing the signal-to-noise ratio (SNR) in the presence of additive stochastic noise. Matched filters are commonly used in radar, in which a known signal is sent out, and the reflected signal is examined for common elements of the out-going signal. Pulse compression is an example of matched filtering. It is so called because the impulse response is matched to input pulse signals. Two-dimensional matched filters are commonly used in image processing, e.g., to improve the SNR of X-ray observations. Additional applications of note are in seismology and gravitational-wave astronomy. Matched filtering is a demodulation technique with LTI (linear time invariant) filters to maximize SNR. It was originally also known as a North filter. Derivation...
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 08:37:14 JST screwlisp
  in reply to
  
  @shinmera I guess you could base it on a range of people instead of one person, and it would work better on average and worse in any particular case. This is a normal receiver operating characteristic scenario isn't it? There will be a lot of implementations of this sitting around I think. (produces none).
  
  In conversation about 5 months ago permalink
- Embed this notice
  Yukari Hafner :v_lesbian: (shinmera@mastodon.tymoon.eu)'s status on Saturday, 04-Jan-2025 08:37:15 JST Yukari Hafner :v_lesbian:
  in reply to
  - screwlisp
  @screwtape Hmm, yeah, I thought about similar stuff, but I really don't want to train on a specific voice or anything. I guess convolving with an inverse average voice frequency response and then checking deviation could work?
  
  In conversation about 5 months ago permalink
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 09:28:14 JST screwlisp
  in reply to
  
  @shinmera oh, I'm aware people have done what you're saying in particular, but I've never watched such a thing. Basically you do a discrete convolution/correlation of two arrays, the test sample, and the kernel. We expect that bins in the result exceeding some sensitivity number you choose by trial and error are detections of the kernel in the test sample. You judge your quality by the True Positive Fraction and False Positive Fraction for your chosen sensitivity.
  
  In conversation about 5 months ago permalink
- Embed this notice
  Yukari Hafner :v_lesbian: (shinmera@mastodon.tymoon.eu)'s status on Saturday, 04-Jan-2025 09:28:15 JST Yukari Hafner :v_lesbian:
  in reply to
  - screwlisp
  @screwtape I have no idea, and in general signal processing theory stuff is extremely incompatible with my brain, so....
  Anyway, I just want to do something a bit smarter than the usual voice chat thing of thresholding by volume, with the hopes I'll prevent it from being triggered by random noises.
  Since I want to use it to drive the avatar's mouth open/close, having minor noise be treated as signal is far worse than in a voice chat app situation.
  
  In conversation about 5 months ago permalink
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 09:28:59 JST screwlisp
  in reply to
  
  @shinmera I'll do a demo of a matched filter for the show on wednesday, since I was building fourier transform pipeline demos right now anyway.
  
  In conversation about 5 months ago permalink
- Embed this notice
  screwlisp (screwtape@mastodon.sdf.org)'s status on Saturday, 04-Jan-2025 09:36:03 JST screwlisp
  in reply to
  
  @shinmera I would run a bank of filters and max the results. That's my feel for "any of multiple things are happening".
  
  In conversation about 5 months ago permalink

Public

Conversation

Notices

Feeds