V1.8 is an experimental voice-first AI assistant. It listens to spoken prompts, responds out loud, can launch apps on macOS, can search the web for current context, can inspect images from a webcam, and can generate images when the user asks for a visual result.
Untitled.mov
This project combines several assistant behaviors into one loop:
- Speech-to-text using Whisper through
speech_recognition - Text-to-speech using
edge-tts - App launching on macOS for commands such as
open Safari - Web query simplification and search retrieval through DDGS
- Webcam-based vision analysis for appearance or environment questions
- Image generation with Stable Diffusion
- Terminal-friendly output formatting with
rich
This repository is intended for developers and hobbyists who want to explore a local voice assistant workflow. It is especially useful if you are interested in:
- voice interfaces
- multimodal AI interactions
- local automation on macOS
- image generation pipelines
- combining web search, vision, and speech in a single assistant
- macOS
- Python 3.11 or newer is recommended
- A microphone with system permission enabled
- A camera with system permission enabled if you want vision features
- Ollama installed and available on the machine running the script
chafainstalled if you want terminal previews for generated images- Hardware that can run the configured Stable Diffusion pipeline on
mps, or code changes to target a different device
The script uses the following Python packages:
torchollamaspeech_recognitionedge_ttsddgsopencv-pythonrichdiffusersterm-image
-
Clone the repository and open the
V1.8folder. -
Create a virtual environment:
python3 -m venv venv
source venv/bin/activate- Install the dependencies:
pip install torch ollama SpeechRecognition edge-tts ddgs opencv-python rich diffusers term-image- Make sure Ollama can access the models referenced in
main.py.
Run the assistant with:
python main.pyOn startup, the program asks you to choose a voice:
Maleselectsen-US-SteffanNeural- Any other input selects
en-US-AvaNeural
Then the assistant will:
- Prompt you to speak
- Transcribe your speech
- Decide whether the request is for app launching, image generation, vision analysis, or a normal answer
- Speak the response back to you
- Ask whether you want to continue the conversation
If the transcription includes open, the assistant tries to find a matching application on macOS. If no local app is found, it falls back to opening a website based on the target name.
For general questions, the assistant first simplifies the query and fetches recent search results. The response model can use those results when the request is about news, current events, or recent information.
If the prompt seems to require visual context, the assistant captures a frame from the webcam, saves it locally, and sends it to a vision-capable model for analysis.
If the prompt is recognized as an image request, the assistant converts it into a short image prompt, generates an image with Stable Diffusion, saves the result as generated_image.png, and displays it in the terminal.
- The current implementation is macOS-focused.
- The assistant depends on several external models and services.
- The Stable Diffusion pipeline is loaded at startup, which may be slow on lower-powered machines.
- The current code stores generated and captured images in the working directory.
- The app-launching behavior is intentionally simple and may not match every app name perfectly.
- The image preview in the terminal is pixelated and low quality.
This version leaves room for several improvements:
- Add cross-platform support beyond macOS
- Make the model names and device selection configurable through environment variables or a config file
- Add a proper command parser for app launching instead of relying on keyword matching
- Add a conversation history file or database
- Add streaming responses so users hear partial answers sooner
- Add a richer UI for desktop or web use
- Add safer image handling and cleanup for generated files
- Add a setup script or dependency file for easier installation
- If microphone input fails, check system permissions and verify
speech_recognitionis installed correctly. - If camera capture fails, check camera permissions and confirm OpenCV can access the device.
- If image generation fails, verify that your hardware supports the configured device target or update the pipeline configuration.
- If terminal image preview fails, install
chafaand confirm it is available in your PATH.
The MIT licence has been added for maximum freedom