DevOps Swarm AI
Three agent architectures for PR review, code quality, CI diagnosis, and automated reporting
I built a DevOps review assistant on top of Google Cloud's agent-starter-pack scaffold and implemented the same four capabilities through three orchestration patterns: a CrewAI crew, a LangGraph pipeline, and a ReAct-style tool-calling agent. I connected the tool-calling version to FastAPI, added structured failure handling, traced requests through Traceloop and Google Cloud Trace, and added a GitLab CI pipeline for the test suite.
Google Cloud's agent-starter-pack scaffold (v0.4.2) provided the base FastAPI server, Terraform deployment configuration, Streamlit playground, Cloud Build configuration, observability skeleton, and project layout. I implemented the CrewAI crew, the LangGraph sequential pipeline, the ReAct tool-calling agent, the four DevOps tools, the FastAPI execute endpoint, and the GitLab CI pipeline on top of it.
Overview
I built a DevOps review assistant on Google Cloud's agent-starter-pack scaffold. Four capabilities, PR analysis, code quality review, CI log diagnosis, and report generation, each call Gemini 2.5 Pro through Vertex AI with a dedicated prompt. I implemented them three ways: as a CrewAI crew with explicit task dependencies between the PR, quality, and CI agents feeding into a final report agent, as a LangGraph pipeline that runs the same four steps in a fixed sequence, and as a ReAct-style agent that exposes the four capabilities as callable tools and lets the model decide which to call and when to stop. The ReAct agent is the one wired into the FastAPI backend, behind a dedicated execute endpoint that accepts a PR title, diff, CI logs, and merge request ID.
Problem
Reviewing a pull request well means reading the diff, checking code quality, and tracing CI failures back to a cause, three different review tasks that a single general-purpose assistant tends to blend together. Purpose-built agents, each focused on one part of that review, can handle each task more precisely than one assistant trying to do everything at once.
Intended User
Built as an internal developer-tooling experiment to automate parts of CI/CD review work: summarizing pull requests, flagging code quality issues, and diagnosing CI failures.
Architecture
Google Cloud's agent-starter-pack scaffold provides a FastAPI server, Terraform-based GCP deployment configuration, a Streamlit playground, Cloud Build CI/CD configs, and OpenTelemetry observability wiring. On top of that, I built three agent architectures around the same four capabilities. A CrewAI crew (app/crew/crew.py) defines four agents, a Pull Request Analyzer, a Code Quality Inspector, a CI Pipeline Expert, and a Report Generator, as CrewAI tasks with explicit context dependencies, run through a sequential process. A LangGraph pipeline (multi_agent_graph.py) chains four equivalent Gemini-backed node functions in a fixed order, from PR analysis through to a final report. A ReAct-style agent (app/agent.py) binds the same four capabilities as LangChain tools (tools/pr_analyzer_tool.py, quality_check_tool.py, ci_health_check_tool.py, report_tool.py) to the model and routes through a conditional agent-to-tool edge, letting the model decide which tool to call and when the task is complete. This last agent is the one wired into the FastAPI server's stream_messages and execute routes. Every tool wraps its Gemini call in error handling, returning a structured error message rather than failing the loop, and every request is traced through Traceloop into Google Cloud Trace.
My Contribution
I implemented all three agent architectures, the CrewAI crew, the LangGraph pipeline, and the ReAct tool-calling agent, the four underlying DevOps tools, the FastAPI execute endpoint, and a GitLab CI pipeline that runs the unit and integration test suites on every push. Google Cloud's agent-starter-pack scaffold (v0.4.2) provided the base FastAPI server, Terraform deployment configuration, Streamlit playground, Cloud Build configuration, observability skeleton, and project layout that I built all of this on top of.
Implementation
- Implemented a CrewAI crew with four domain-specific agents and four tasks, wiring explicit context dependencies so the quality, CI, and report tasks each receive the prior tasks output.
- Implemented a LangGraph pipeline that chains the same four capabilities, PR analysis, code quality, CI diagnosis, and reporting, in a fixed sequence.
- Implemented a ReAct-style agent that exposes the four capabilities as callable tools and lets the model decide which to call and when to stop, rather than following a fixed sequence.
- Wrapped every tool call in error handling so a failed Gemini request returns a structured error message instead of crashing the agent loop.
- Added a dedicated execute endpoint to the FastAPI server that accepts a PR title, diff, CI logs, and merge request ID, and routes them into the tool-calling agent.
- Added a GitLab CI pipeline that installs dependencies and runs the unit and integration test suites on every push.
Key Decisions
Four task-specific agents over one general-purpose assistant
Why — PR review, code quality, CI diagnosis, and reporting were different enough tasks that dedicated agents and prompts stayed focused rather than one agent trying to do everything.
Building the same capability three ways: a CrewAI crew, a LangGraph pipeline, and a ReAct tool-calling agent
Why — A fixed pipeline gives predictable, repeatable ordering for the same four steps every time. A tool-calling agent lets the model decide which steps a given request actually needs. Building both, plus a CrewAI version with explicit task dependencies, let me compare a deterministic workflow against a model-driven one on the same underlying logic before choosing the tool-calling agent as the one wired into the live server.
Building on Google Cloud's agent-starter-pack scaffold rather than a bare setup
Why — The scaffold already provided a FastAPI server, Terraform deployment configuration, and observability wiring, so I could spend my own time on the agents, tools, and orchestration logic instead of rebuilding infrastructure.
A dedicated GitLab CI pipeline on top of the scaffold's own Cloud Build configuration
Why — I wanted the unit and integration test suites to run automatically on every push rather than relying on manual test runs.
Testing & Validation
Integration tests exercise the live ReAct agent and its streaming behavior end to end. A separate test launches the FastAPI server as a subprocess and calls the real HTTP routes, validating the application beyond isolated helper functions.
Results
I implemented four DevOps review capabilities through three working agent architectures. The CrewAI crew models explicit task dependencies, the LangGraph pipeline provides predictable sequential execution, and the ReAct agent selects tools dynamically based on the request. I wired the tool-calling version into FastAPI, added structured error handling around every Gemini call, and traced each request through Traceloop and Google Cloud Trace.
Reliability & Failure Handling
Every tool wraps its Gemini call in error handling, returning a structured error message and message list instead of raising, so a failed model call surfaces as a normal response rather than crashing the active agent loop. Every request is traced through Traceloop, tagged with run, user, session, and commit metadata, and exported to Google Cloud Trace.
Deployment & Runtime
The project includes Terraform configuration for GCP, Cloud Build deployment workflows from the scaffold, and a GitLab CI pipeline I added to run the test suite on every push. The deployment path was exercised during the project period, but no public live instance is maintained today.
Lessons Learned
- Building the same four capabilities three different ways made the trade-off between orchestration styles concrete. The CrewAI crew and the LangGraph pipeline both run every step in the same fixed order every time, which makes their behavior predictable and easy to trace, but neither can skip a step a given request does not need. The ReAct-style tool-calling agent decides which tools to call based on the input, so a request with no CI logs is not forced through a CI diagnosis step, but its behavior depends on the model choosing correctly rather than a fixed graph. Wrapping every tool call in structured error handling mattered across all three versions, since a failed Gemini call needed to surface as a normal message rather than break the orchestration loop.