Use speech recognition in gambas
Posted
#1
(In Topic #1018)
Enthusiast

I am trying to start a project. It is a voice recognition but very brief. It's probably two words. My questions are:
1. Does this possibility exist?
2. If it exists, is it possible to obtain the written result for an application made in Gambas?
They are encouraged to guide me in this challenge.
Thank you.
Note:
======
Have:
Debian as the operating system.
More:
It is possible that I have not explained well.
What I want is this:
1. A user says two words into a microphone.
2. Those two words are received by a free software voice recognizer that I still don't know what it will be.
3. This library will convert speech to text.
4. This is exactly what I want to do. Recover the text and compare it with orders that I am going to give to the system from Gambas
So I need:
1. What voice recognition that converts the sound of the microphone to text do I need so that Gambas can use it, or just know how to use a voice recognition and its result in text to use it for Gambas.
I hope now my idea is clear.
For your misfortunes I am Spanish and I only know Spanish, please, be patient with me, Thank you.
Posted
Regular

Europaeus sum !
<COLOR color="#FF8000">Amare memorentes atque deflentes ad mortem silenter labimur.</COLOR>
<COLOR color="#FF8000">Amare memorentes atque deflentes ad mortem silenter labimur.</COLOR>
Posted
Enthusiast

I have seen this:
Vosk Speech Recognition Toolkit
But to be honest I have no idea how I can talk to Vosk and then use it on Gambas. Because at the end of everything I only want this:
A person says something to a computer and through Vosk, for example, it translates what the person says into text, and then I take the text and if it meets what I want in a comparison that I will do in Gambas, I execute an order so that another order by net fulfills the wish of someone elsewhere.
For your misfortunes I am Spanish and I only know Spanish, please, be patient with me, Thank you.
Posted
Regular

Well, I found this code in C language:gambafeliz said
But to be honest I have no idea how I can talk to Vosk and then use it on Gambas.
https://github.com/alphacep/vosk-api/blob/master/c/test_vosk_speaker.c
I don't know if it's suitable; it would seem so.
It should be noted that this code does not translate speech directly to text via microphone, but uses a "wav" format audio file, in which the speech has been previously recorded.
I didn't install Vosk library, however :? I tried to translate it into Gambas language using the external functions of Vosk API:
https://github.com/alphacep/vosk-api/blob/master/src/vosk_api.h
I also specify that, since I haven't installed the Vosk resource, I obviously :? couldn't test my code.
Code (gambas)
- Library "libvosk..."
- ' VoskModel *vosk_model_new(const char *model_path)
- ' Loads model data from the file and returns the model object.
- ' VoskSpkModel *vosk_spk_model_new(const char *model_path)
- ' Loads speaker model data from the file and returns the model object.
- ' VoskRecognizer *vosk_recognizer_new_spk(VoskModel *model, float sample_rate, VoskSpkModel *spk_model)
- ' Creates the recognizer object with speaker recognition.
- ' int vosk_recognizer_accept_waveform(VoskRecognizer *recognizer, const char *data, int length)
- ' Accept voice data
- ' const char *vosk_recognizer_result(VoskRecognizer *recognizer)
- ' Returns speech recognition result.
- ' const char *vosk_recognizer_partial_result(VoskRecognizer *recognizer)
- ' Returns partial speech recognition.
- ' const char *vosk_recognizer_final_result(VoskRecognizer *recognizer)
- ' Returns speech recognition result. It doesn't wait for silence.
- ' void vosk_recognizer_free(VoskRecognizer *recognizer)
- ' Releases recognizer object.
- ' void vosk_spk_model_free(VoskSpkModel *model)
- ' Releases the model memory.
- ' void vosk_model_free(VoskModel *model)
- ' Releases the model memory.
- Library "libc:6"
- ' FILE *fopen (const char *__restrict __filename, const char *__restrict __modes)
- ' Open a file and create a new stream for it.
- ' int fseek(FILE *__stream, long int __off, int __whence)
- ' Seek to a certain position on STREAM.
- ' int feof (FILE *__stream)
- ' Return the EOF indicator for STREAM.
- ' size_t fread(void *__restrict __ptr, size_t __size, size_t __n, FILE *__restrict __stream)
- ' Read chunks of generic data from STREAM.
- ' int fclose (FILE *__stream)
- ' Close STREAM.
- model = vosk_model_new("model")
- spk_model = vosk_spk_model_new("spk-model")
- recognizer = vosk_recognizer_new_spk(model, 16000.0, spk_model)
- wavin = fopen("/path/of/file.wav", "rb")
- fseek(wavin, 44, SEEK_SET)
- nread = fread(buf, 1, buf.Count, wavin)
- final = vosk_recognizer_accept_waveform(recognizer, buf, nread)
- If final
- fclose(wavin)
- vosk_recognizer_free(recognizer)
- vosk_spk_model_free(spk_model)
- vosk_model_free(model)
Europaeus sum !
<COLOR color="#FF8000">Amare memorentes atque deflentes ad mortem silenter labimur.</COLOR>
<COLOR color="#FF8000">Amare memorentes atque deflentes ad mortem silenter labimur.</COLOR>
Posted
Guru

Posted
Enthusiast

As always both to the rescue. I will try to use what you indicate to see if I am able to start the idea.
For your misfortunes I am Spanish and I only know Spanish, please, be patient with me, Thank you.
Posted
Regular

To be frank, in general they are still mainly useless. The accuracy is generally very poor.
I assume that you want to use a utility that doesn't require training too be done by the speaker, in other words you want to use a default model provided by the utility. Now, most of these are "english as she is spoke by Amer-kans" which is to be expected. (I am a "Strine" which has much nicer phonemes by the way!)
After repeating the following input into the microphone a dozen times I gave up attempting to say the phrase the same way without mistakes and finally installed an audio recorder, I used the gnome-audio-recorder by Osmo Antero just for convience sake. It's pretty rudimentary but does the job. This is the input phrase:
"There was movement at the station for the word had passed around,
that the colt from 'Old Regret' had got away."
I'll ignore the other dozen or so that I tried that just delivered garbage like "ten wars moon men" and just report the two that stood out.
pocketsphinx
PRO: fast
CON: medium accuracy for untrained models
RESULT: there was movement at the station the word that caused the rare that the call from all regret had gone already
vosk
PRO: much better untrained accuracy than any other I tried
CON: very slow at first as it has to generate it's default model, but speeds up as long as you don't reboot.
RESULT: there was movement at the station for the word had passed around that the cult from all regret had got away
Both do have API's that can be used as Vuott says. I haven't looked at them yet apart from the pocketsphinx api looks a lot simpler than the vosk one, which is v e r y complex (but possibly worth the effort).
Looking forwards to your results!
p.s. Tried to attach the input mp3 file I used but it appears that phpBB has never heard of audio files :-)
Posted
Enthusiast

I also note that you found something useful.
I tell you I don't want a conversation recognizer.
I look for the user to say something that he sees on the screen, example 4A, and with that I have enough to interpret it as a command.
I'm telling you this so you know exactly what I'm looking for.
Let's say it's this:
1. I present some codes on the screen at will.
2. The user chooses with his voice.
3. I get this converted to text in Gambas.
And finally I execute some command programmed for this obtained string.
For your misfortunes I am Spanish and I only know Spanish, please, be patient with me, Thank you.
Posted
Regular

Regardless of the speech-to-text library you employ, you will need to "convert" the string it "heard" to something your program will know. For example, suppose the user picks "4X", in english models (and with an english speaker) you are likely to end up with something like "four eggs" or "for eggs". So you need to train your program, not the S2T library, that "four eggs" and "for eggs" is the possible text for the "4X" command.
Now given your location I raise the question, what language(s) are your users going to use? Are dialects going to be a problem? etc
So to get you moving I think you will need some sort of a lookup table to convert the delivered text as spoken by user X in language Y with dialect Z into the required command.
b
1 guest and 0 members have just viewed this.


