AI & Machine Learning2025Completed

AI Assistant M3 - Conversational RAG Agent with Voice I/O

Production-ready AI assistant with reasoning agent, voice input/output (OpenAI Whisper & TTS), RAG document Q&A, cloud storage, and 7 integrated tools. Features conversational context awareness, natural voice interaction with 6 voice personalities, lazy loading, and ChatGPT-style session management.

AI Assistant M3 - Conversational RAG Agent with Voice I/O

Technologies Used

PythonStreamlitLangChainRAGOpenAI GPT-4OpenAI WhisperOpenAI TTSGoogle Cloud FirestoreGoogle Cloud PlatformChromaDBOpenAI EmbeddingsTavilyWebRTCaudio-recorder-streamlit

Project Overview

🤖 Conversational RAG Agent with Voice I/O & Cloud Storage

Built a production-grade AI assistant featuring an intelligent reasoning agent, RAG-based document Q&A, conversational context awareness, natural voice interaction with OpenAI Whisper & TTS, and Firebase cloud storage with ChatGPT-style session management.

🧠 Intelligent Agent System

  • Reasoning agent that autonomously selects appropriate tools based on context
  • Conversational context awareness - understands pronouns and follow-up questions
  • Two-level memory: agent history (10 msgs) + RAG history (20 msgs)
  • Custom enhanced prompts for optimal tool selection
  • GPT-4 integration with LangChain agent framework
  • Handles multi-tool queries intelligently

🎤 Voice Input/Output System (NEW!)

  • OpenAI Whisper integration for high-accuracy speech recognition (99+ languages)
  • Natural text-to-speech with 6 voice personalities (OpenAI TTS)
  • Auto-speak mode for hands-free conversational experience
  • Real-time audio transcription with browser WebRTC API
  • Voice-enabled document Q&A - speak questions, hear responses
  • Manual playback controls for any AI message
  • Base64 audio encoding with HTML5 autoplay integration

📄 RAG Document Q&A

  • Upload PDFs, Word docs, and text files for Q&A
  • ChromaDB vector store with OpenAI embeddings
  • Conversational RAG - reformulates questions using chat history
  • Document chunking with recursive text splitter (1000/200)
  • Smart retrieval with top-3 similarity search
  • Supports follow-up questions like "summarize it", "tell me more"

🛠️ Seven Integrated Tools

  • 🔍 Web Search - Tavily API for current information
  • 🌤️ Weather - Real-time data via OpenWeatherMap
  • 💱 Currency Converter - Live exchange rates (50+ currencies)
  • 📊 Stock Prices - Current market data lookup (SerpAPI)
  • 🧮 Calculator - Safe mathematical expression evaluation
  • 📄 Document Q&A - RAG-based conversational queries
  • 🎤 Voice I/O - Speech recognition & natural TTS

☁️ Google Cloud Storage & Session Management

  • Google Cloud Firestore integration for persistent storage
  • Cloud-native architecture with service account authentication
  • ChatGPT-style UI with smart session titles (generated from first message)
  • Lazy loading - sessions load only when clicked (5-8x faster)
  • Auto-save functionality with session caching
  • Delete sessions with one click
  • Supports multiple conversations with seamless switching
  • Scalable cloud infrastructure supporting 100+ concurrent users

🏗️ Architecture & Design

  • Modular architecture: separate tools, agents, RAG, UI, voice, and utils packages
  • Tool decorator pattern for easy extensibility
  • Pydantic schemas for type-safe tool inputs
  • Session state management with Streamlit
  • Error handling and graceful API failure recovery
  • Configuration-driven design (easy to customize)
  • Separation of concerns - voice logic isolated from core chat

⚡ Performance Optimizations

  • Lazy loading for 5-8x faster startup
  • Session metadata caching (no repeated Firebase queries)
  • Title storage at creation (not generated each time)
  • Optimized Firestore reads (metadata only, not full messages)
  • Agent caching with @st.cache_resource
  • Configurable auto-load for speed vs UX balance
  • Efficient audio encoding with temporary file cleanup

🎯 Voice Module Technical Details

  • Browser WebRTC MediaRecorder for real-time audio capture
  • OpenAI Whisper API with language hints for transcription accuracy
  • OpenAI TTS API with configurable models (tts-1/tts-1-hd)
  • Base64 audio encoding for browser-native playback
  • Temporary file management for secure audio processing
  • Session state for voice preferences (auto-speak, selected voice)
  • Cost-optimized with configurable quality settings

🏆 Technical Achievements

  • Intelligent Agent: Built reasoning agent that autonomously selects and chains tools with conversational context awareness
  • Voice Integration: Implemented production-grade voice I/O system with OpenAI Whisper (speech recognition) and TTS (6 natural voices) for hands-free interaction
  • Conversational RAG: Implemented dual-memory system with question reformulation for natural document Q&A
  • Cloud Architecture: Integrated Google Cloud Firestore for production-grade data persistence with service account security
  • Production Design: Modular, scalable architecture with 7 tools, voice I/O, cloud storage, and ChatGPT-style UX
  • Performance: Achieved 5-8x faster load times through lazy loading and intelligent caching strategies

💡 Key Features

Reasoning AgentVoice Input/OutputOpenAI Whisper & TTSConversational ContextRAG Document Q&ACloud StorageChatGPT-style UILazy LoadingMulti-Tool IntegrationModular Architecture

Related Projects

Self-Driving Car Simulation - Deep Learning CNN
AI & Machine Learning
⭐ Featured

End-to-end deep learning system using NVIDIA CNN architecture to autonomously drive a car in Udacity simulator. Features real-time steering prediction from camera images with comprehensive data augmentation and preprocessing pipeline.

PythonTensorFlowKerasOpenCV+4 more
Sports Motion Detection & Viewport Tracking
AI & Machine Learning
⭐ Featured

A Python-based motion detection and viewport tracking system that simulates a "virtual camera" for sports video analysis using computer vision techniques.

PythonOpenCVNumPyComputer Vision