AI & Machine Learning2025Completed
AI Assistant M3 - Conversational RAG Agent with Voice I/O
Production-ready AI assistant with reasoning agent, voice input/output (OpenAI Whisper & TTS), RAG document Q&A, cloud storage, and 7 integrated tools. Features conversational context awareness, natural voice interaction with 6 voice personalities, lazy loading, and ChatGPT-style session management.

Technologies Used
PythonStreamlitLangChainRAGOpenAI GPT-4OpenAI WhisperOpenAI TTSGoogle Cloud FirestoreGoogle Cloud PlatformChromaDBOpenAI EmbeddingsTavilyWebRTCaudio-recorder-streamlit
Project Overview
🤖 Conversational RAG Agent with Voice I/O & Cloud Storage
Built a production-grade AI assistant featuring an intelligent reasoning agent, RAG-based document Q&A, conversational context awareness, natural voice interaction with OpenAI Whisper & TTS, and Firebase cloud storage with ChatGPT-style session management.
🧠 Intelligent Agent System
- Reasoning agent that autonomously selects appropriate tools based on context
- Conversational context awareness - understands pronouns and follow-up questions
- Two-level memory: agent history (10 msgs) + RAG history (20 msgs)
- Custom enhanced prompts for optimal tool selection
- GPT-4 integration with LangChain agent framework
- Handles multi-tool queries intelligently
🎤 Voice Input/Output System (NEW!)
- OpenAI Whisper integration for high-accuracy speech recognition (99+ languages)
- Natural text-to-speech with 6 voice personalities (OpenAI TTS)
- Auto-speak mode for hands-free conversational experience
- Real-time audio transcription with browser WebRTC API
- Voice-enabled document Q&A - speak questions, hear responses
- Manual playback controls for any AI message
- Base64 audio encoding with HTML5 autoplay integration
📄 RAG Document Q&A
- Upload PDFs, Word docs, and text files for Q&A
- ChromaDB vector store with OpenAI embeddings
- Conversational RAG - reformulates questions using chat history
- Document chunking with recursive text splitter (1000/200)
- Smart retrieval with top-3 similarity search
- Supports follow-up questions like "summarize it", "tell me more"
🛠️ Seven Integrated Tools
- 🔍 Web Search - Tavily API for current information
- 🌤️ Weather - Real-time data via OpenWeatherMap
- 💱 Currency Converter - Live exchange rates (50+ currencies)
- 📊 Stock Prices - Current market data lookup (SerpAPI)
- 🧮 Calculator - Safe mathematical expression evaluation
- 📄 Document Q&A - RAG-based conversational queries
- 🎤 Voice I/O - Speech recognition & natural TTS
☁️ Google Cloud Storage & Session Management
- Google Cloud Firestore integration for persistent storage
- Cloud-native architecture with service account authentication
- ChatGPT-style UI with smart session titles (generated from first message)
- Lazy loading - sessions load only when clicked (5-8x faster)
- Auto-save functionality with session caching
- Delete sessions with one click
- Supports multiple conversations with seamless switching
- Scalable cloud infrastructure supporting 100+ concurrent users
🏗️ Architecture & Design
- Modular architecture: separate tools, agents, RAG, UI, voice, and utils packages
- Tool decorator pattern for easy extensibility
- Pydantic schemas for type-safe tool inputs
- Session state management with Streamlit
- Error handling and graceful API failure recovery
- Configuration-driven design (easy to customize)
- Separation of concerns - voice logic isolated from core chat
⚡ Performance Optimizations
- Lazy loading for 5-8x faster startup
- Session metadata caching (no repeated Firebase queries)
- Title storage at creation (not generated each time)
- Optimized Firestore reads (metadata only, not full messages)
- Agent caching with @st.cache_resource
- Configurable auto-load for speed vs UX balance
- Efficient audio encoding with temporary file cleanup
🎯 Voice Module Technical Details
- Browser WebRTC MediaRecorder for real-time audio capture
- OpenAI Whisper API with language hints for transcription accuracy
- OpenAI TTS API with configurable models (tts-1/tts-1-hd)
- Base64 audio encoding for browser-native playback
- Temporary file management for secure audio processing
- Session state for voice preferences (auto-speak, selected voice)
- Cost-optimized with configurable quality settings
🏆 Technical Achievements
- Intelligent Agent: Built reasoning agent that autonomously selects and chains tools with conversational context awareness
- Voice Integration: Implemented production-grade voice I/O system with OpenAI Whisper (speech recognition) and TTS (6 natural voices) for hands-free interaction
- Conversational RAG: Implemented dual-memory system with question reformulation for natural document Q&A
- Cloud Architecture: Integrated Google Cloud Firestore for production-grade data persistence with service account security
- Production Design: Modular, scalable architecture with 7 tools, voice I/O, cloud storage, and ChatGPT-style UX
- Performance: Achieved 5-8x faster load times through lazy loading and intelligent caching strategies
💡 Key Features
Reasoning AgentVoice Input/OutputOpenAI Whisper & TTSConversational ContextRAG Document Q&ACloud StorageChatGPT-style UILazy LoadingMulti-Tool IntegrationModular Architecture

