Ever dreamed of having your own personal AI assistant that can use your computer like you do? With OmniParser V2 from Microsoft, that future is already here, and this guide will show you how to take your very first steps.
You don’t need to be a coder or tech expert. If you can follow simple instructions, you can build your first AI agent today.
✍️ About the Author
This article was written by Nuraj Shaminda, a tech blogger passionate about making AI tools accessible for everyone. With hands-on experience testing over 50 AI apps and models, Nuraj Shaminda specializes in beginner-friendly guides that empower creators, developers, and curious learners.
What Is an AI Agent?
An AI agent is like a smart digital assistant. It can look at your screen, understand what’s on it (like a button or a form), and take action , like clicking, typing, or filling out spreadsheets.
Thanks to Microsoft’s free and open-source tool OmniParser V2, your AI model (like ChatGPT or Claude) can now do all of that.
Tested by the author on a MacBook Air with GPT-4o, it works beautifully with real screenshots.
What You’ll Learn in This Guide
This beginner-friendly guide will walk you through:
- What tools you need
- How to install OmniParser V2 step by step
- How to launch your first demo
- What’s actually happening behind the scenes
- What to do next
All steps are based on Microsoft’s official blog post and GitHub repository.
What You’ll Need
Tool | Why It’s Needed |
---|---|
Anaconda (or Miniconda) | For managing Python tools |
Git | To download OmniParser’s code |
Python (3.12) | The programming language it runs on |
Hugging Face account | To get the AI model weights |
Terminal/Command Prompt | Where you’ll type the commands |
📌 No need to code — just follow each instruction step by step.
Step-by-Step: Build Your First AI Agent
Step 1: Install Anaconda (or Miniconda)
- Visit Anaconda Download Page
- Choose your operating system
- Install with default settings
Step 2: Open Your Terminal
- Windows: Search for “Anaconda Prompt”
- Mac: Open Terminal app
- Linux: Press Ctrl+Alt+T
Step 3: Download OmniParser
git clone https://github.com/microsoft/OmniParser.git
cd OmniParser
Step 4: Set Up a Safe Python Environment
conda create -n omni python=3.12 -y
conda activate omni
Step 5: Install Required Packages
pip install -r requirements.txt
Step 6: Set Up Hugging Face for Model Access
- Create a free account at HuggingFace.co
- In terminal:
huggingface-cli login
- Paste your Hugging Face access token
- Then run:
rm -rf weights/icon_detect weights/icon_caption weights/icon_caption_florence
huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence
Step 7: Launch the Demo!
python gradio_demo.py
Your browser will open a local app where you can upload a screenshot. The AI will describe what it sees: buttons, fields, menus, etc.
What’s Happening Under the Hood?
- Detection Module (YOLOv8): Finds clickable items on your screen
- Captioning Module (Florence-2): Explains what those items do
The result? You get an LLM that can see and act, like a mini AI intern.
What You Can Build Next
- Connect it with GPT-4o, Sonnet, or Claude
- Try OmniTool — a full system for interacting with apps
- Automate tasks like:
- Opening email
- Clicking buttons in forms
- Filling out spreadsheets or websites
- Use tools like LangChain, LM Studio, or AutoGen
Bonus: Connect OmniParser with GPT-4o
Now that OmniParser can “see” your screen, you’ll want an AI that can make decisions and give it commands, that’s where GPT-4o comes in.
Option 1: Use GPT-4o in LM Studio (Locally)
- Download and install LM Studio
- Import your GPT-4o (or compatible) model
- Copy your local endpoint (like
http://localhost:1234/v1/chat/completions
) - In OmniParser, update the Python script to send text prompts to this endpoint
Sample Use Case
OmniParser reads the screen and sends: “There is a button labeled ‘Submit Invoice'”. GPT-4o responds with: “Click the ‘Submit Invoice’ button.”
You can then pass this response to a click executor function, turning GPT into a hands-on assistant.
Trust & Safety Notice
OmniParser is an open-source project maintained by Microsoft Research and available on GitHub. Always review the code and understand what you’re running, especially when downloading third-party models.
If you’re using screenshots from private apps or sites, avoid sharing them publicly.
Final Thoughts
You’ve just built your first computer-using AI assistant, without writing a single line of code. OmniParser V2 unlocks the next phase of AI: not just thinking, but doing.
The future of AI isn’t passive. It’s active. And now, you’re part of it.