
Implementation Guide: Conduct full due diligence review of a document package and produce a risk report
Step-by-step implementation guide for deploying AI to conduct full due diligence review of a document package and produce a risk report for Legal Services clients.
Hardware Procurement
Attorney Workstation
Attorney Workstation
$1,350 per unit MSP cost / $1,750 suggested resale
Primary attorney workstation for accessing AI due diligence tools, reviewing documents side-by-side with AI risk analysis, and approving human-in-the-loop checkpoints. 16GB RAM minimum required for smooth operation of Microsoft 365 Copilot, Spellbook Word add-in, and browser-based AI dashboards simultaneously.
Dual Monitor Setup
Dual Monitor Setup
$280 per unit MSP cost / $370 suggested resale
Two monitors per attorney workstation. Left monitor displays source documents from VDR/DMS, right monitor displays AI review pane and risk report. QHD resolution essential for reading dense legal documents without excessive scrolling.
Document Scanner
$550 per unit MSP cost / $750 suggested resale
High-speed duplex document scanner for digitizing paper documents that arrive as part of due diligence packages. 70 ppm scan speed handles large document sets efficiently. TWAIN/ISIS drivers integrate with Adobe Acrobat and DMS ingest workflows. Two units provide redundancy and serve separate office areas.
Network Switch Upgrade
UniFi USW-Pro-24-PoE
$400 MSP cost / $550 suggested resale
Ensures sufficient bandwidth for uploading large document packages (100+ MB per transaction) to cloud AI services and VDRs. PoE supports VoIP phones and access points without separate power infrastructure. Replaces aging consumer-grade switches common in SMB law firms.
Wireless Access Point
$150 per unit MSP cost / $220 suggested resale
Provides reliable Wi-Fi 6 coverage for attorneys working from conference rooms during deal closings. Ensures stable connectivity to cloud AI services from anywhere in the office.
Software Procurement
Microsoft 365 E3
$36/user/month x 10 users = $360/month MSP cost / $450/month suggested resale
Foundation platform providing Exchange Online, SharePoint Online, Teams, Word, and Azure AD/Entra ID for SSO. SharePoint serves as interim document staging area for AI ingest. Word is the primary contract review surface for Spellbook integration. Azure AD provides identity and access management for all AI tools.
Microsoft 365 Copilot
$30/user/month x 10 users = $300/month MSP cost / $400/month suggested resale
AI assistant embedded in Word, Outlook, Teams, and SharePoint. Used for summarizing email threads related to transactions, drafting correspondence about due diligence findings, searching across SharePoint document libraries, and generating meeting notes from deal team calls. Complements but does not replace the specialized legal AI tools.
Clio Manage (Complete Plan)
$139/user/month x 10 users = $1,390/month MSP cost / $1,750/month suggested resale
Practice management system providing matter management, time tracking, client communication, and billing. Manage AI (formerly Clio Duo) provides built-in AI capabilities for summarizing matter timelines, drafting client communications, and searching firm knowledge. Critical for logging AI-assisted work hours and generating LEDES billing exports. All due diligence matters are tracked here.
Spellbook
$100–$179/user/month = $500–$895/month MSP cost / $700–$1,100/month suggested resale
Word-native AI contract review tool that provides clause-level analysis, risk flagging, missing clause detection, and suggested language. Licensed for the 5 attorneys who most frequently handle due diligence matters. Integrates directly into Microsoft Word ribbon for seamless workflow. Uses GPT-5, Claude, and other leading LLMs. Lowest barrier to entry for SMB firms.
Azure OpenAI Service (GPT-5.4)
$2.50/million input tokens + $10.00/million output tokens; estimated $500–$1,500/month based on 3–5 DD transactions/month / suggested resale with 25% markup = $625–$1,875/month
Core LLM API powering the custom due diligence orchestration agent. GPT-5.4 provides 128K context window for processing large contract sections. Azure deployment ensures data residency within chosen region, BAA availability for HIPAA-adjacent matters, and compliance with firm security policies. All API calls route through Azure Virtual Network for network isolation.
Azure AI Document Intelligence
$1.50–$15.00 per 1,000 pages; estimated $50–$200/month / suggested resale $75–$275/month
OCR and document parsing service that converts scanned PDFs, images, and complex multi-column documents into structured text. Extracts tables, key-value pairs, and form fields. Critical for processing paper documents and poorly formatted PDFs common in due diligence packages. Pre-built models for invoices, receipts, and contracts accelerate processing.
Pinecone Vector Database
Free tier for development; Standard plan $70–$200/month for production / suggested resale $100–$275/month
Stores vector embeddings of all documents in the due diligence package, enabling semantic search across the entire corpus. The custom DD agent queries Pinecone to find related clauses across different contracts, identify contradictions between documents, and retrieve relevant precedent from the firm's historical DD reports. Serverless deployment eliminates infrastructure management.
iManage Work 10 Cloud
$39–$50/user/month x 10 users = $390–$500/month MSP cost / $500–$650/month suggested resale
Legal-grade document management system serving as the system of record for all due diligence documents, work product, and final reports. Provides ethical walls, matter-centric organization, version control, and audit trails required for legal compliance. AI agent reads from and writes to iManage via REST API. If the client already has NetDocuments, substitute accordingly.
Adobe Acrobat Pro
$23/user/month x 10 users = $230/month MSP cost / $300/month suggested resale
PDF manipulation, OCR, redaction, and Bates stamping for due diligence documents. Attorneys use it to review AI-flagged sections in context, apply redactions before sharing, and prepare final DD report PDFs. Built-in OCR supplements Azure AI Document Intelligence for complex layouts.
Veeam Backup for Microsoft 365
$2–$4/user/month x 10 users = $20–$40/month MSP cost / $50–$80/month suggested resale
Backs up all SharePoint, OneDrive, Exchange, and Teams data including AI-generated reports and client communications stored in Microsoft 365. Legal hold and granular restore capabilities support litigation readiness and data retention obligations.
Prerequisites
- Stable internet connection with minimum 100 Mbps symmetric bandwidth and less than 50ms latency to Azure East US or nearest region
- Active Microsoft 365 E3 or E5 tenant with Azure AD/Entra ID configured for all attorney and staff accounts
- Azure subscription with billing configured and OpenAI Service access approved (may require application at https://aka.ms/oai/access)
- Active iManage Work 10 Cloud or NetDocuments subscription with API access enabled and matter structure configured
- Clio Manage account provisioned with Complete plan and Manage AI enabled for all users
- Client engagement letters updated to include AI disclosure language per ABA Formal Opinion 512 — client must have informed consent for AI use on their matters
- Firm AI usage policy drafted and signed by all attorneys covering: permitted uses, confidentiality obligations, supervision requirements, and prohibited uses per ABA Model Rules 1.1, 1.6, and 5.3
- Data Processing Agreement (DPA) executed with every AI vendor (Azure, Spellbook, Pinecone) specifying: no training on firm data, data residency requirements, breach notification within 72 hours, and right to deletion
- Python 3.11+ runtime environment on the MSP's deployment workstation or Azure VM for custom agent deployment
- Domain name or subdomain for the DD report portal (e.g., dd-reports.firmname.com) with DNS access
- SSL/TLS certificates for any custom-hosted endpoints
- Administrative access to firm's firewall/router for allowlisting Azure and SaaS endpoints
- Fujitsu PaperStream drivers installed on scanning workstations
- All attorneys who will use the system have completed a baseline AI ethics training (provided by MSP in Phase 4)
Installation Steps
Step 1: Azure Environment Setup and OpenAI Service Provisioning
Create the Azure resource group, configure networking, and provision the Azure OpenAI Service instance that will power the custom due diligence agent. This establishes the AI compute foundation with proper security boundaries.
az login
az group create --name rg-legal-dd-agent --location eastus
az network vnet create --resource-group rg-legal-dd-agent --name vnet-dd-agent --address-prefix 10.0.0.0/16 --subnet-name subnet-ai --subnet-prefix 10.0.1.0/24
az cognitiveservices account create --name oai-legal-dd --resource-group rg-legal-dd-agent --kind OpenAI --sku S0 --location eastus --custom-domain oai-legal-dd
az cognitiveservices account deployment create --name oai-legal-dd --resource-group rg-legal-dd-agent --deployment-name gpt-5.4-dd --model-name gpt-5.4 --model-version 2024-08-06 --model-format OpenAI --sku-capacity 80 --sku-name Standard
az cognitiveservices account deployment create --name oai-legal-dd --resource-group rg-legal-dd-agent --deployment-name text-embedding-3-large --model-name text-embedding-3-large --model-version 1 --model-format OpenAI --sku-capacity 120 --sku-name Standard
az cognitiveservices account keys list --name oai-legal-dd --resource-group rg-legal-dd-agentRequest Azure OpenAI access in advance — approval can take 1–5 business days. The SKU capacity of 80K tokens per minute for GPT-5.4 should handle 3–5 concurrent DD transactions. Increase capacity if the firm processes more than 5 deals simultaneously. Store the API key securely in Azure Key Vault (configured in step 3). Choose the Azure region closest to the firm's office for lowest latency.
Step 2: Azure AI Document Intelligence Provisioning
Deploy the document parsing service that converts scanned PDFs and images into structured text for the AI agent to process. This is critical because due diligence packages frequently contain scanned documents, faxes, and poorly formatted legacy files.
az cognitiveservices account create --name doc-intel-legal-dd --resource-group rg-legal-dd-agent --kind FormRecognizer --sku S0 --location eastus
az cognitiveservices account keys list --name doc-intel-legal-dd --resource-group rg-legal-dd-agentThe S0 tier supports up to 15 concurrent requests. For large DD packages (500+ documents), consider queuing document processing to stay within rate limits. Azure AI Document Intelligence replaces the older Form Recognizer branding but uses the same API endpoints.
Step 3: Azure Key Vault and Security Configuration
Create a centralized secret store for all API keys, connection strings, and certificates. Configure managed identities so the DD agent can access services without embedding credentials in code. This is essential for legal compliance and audit readiness.
az keyvault create --name kv-legal-dd --resource-group rg-legal-dd-agent --location eastus --enable-rbac-authorization true
az keyvault secret set --vault-name kv-legal-dd --name AzureOpenAIKey --value <YOUR-OPENAI-KEY>
az keyvault secret set --vault-name kv-legal-dd --name DocIntelKey --value <YOUR-DOC-INTEL-KEY>
az keyvault secret set --vault-name kv-legal-dd --name PineconeApiKey --value <YOUR-PINECONE-KEY>
az keyvault secret set --vault-name kv-legal-dd --name iManageClientSecret --value <YOUR-IMANAGE-SECRET>
az webapp identity assign --resource-group rg-legal-dd-agent --name app-dd-agent
az keyvault set-policy --name kv-legal-dd --object-id <MANAGED-IDENTITY-OID> --secret-permissions get listNever store API keys in source code or environment variables on developer machines. All secrets must be retrieved at runtime from Key Vault. Enable Azure Key Vault logging to capture all access events for compliance audits. Rotate all keys on a 90-day schedule.
Step 4: Pinecone Vector Database Setup
Create the Pinecone index that stores document embeddings for semantic search across the due diligence corpus. The DD agent uses this to find related clauses across different documents, detect contradictions, and retrieve relevant precedent.
pip install pinecone-client
python3 -c "
import pinecone
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key='YOUR_PINECONE_API_KEY')
pc.create_index(
name='legal-dd-corpus',
dimension=3072,
metric='cosine',
spec=ServerlessSpec(
cloud='aws',
region='us-east-1'
)
)
print('Index created successfully')
print(pc.describe_index('legal-dd-corpus'))
"Dimension 3072 matches the text-embedding-3-large model output. Use cosine similarity for best results with normalized legal text embeddings. The serverless spec on AWS us-east-1 ensures low latency from Azure East US. Each DD matter should use a separate namespace within the index for data isolation: namespace format is 'matter-{CLIO_MATTER_ID}'.
Step 5: iManage API Integration Setup
Configure the iManage REST API connection that allows the DD agent to read documents from and write reports to the firm's document management system. This ensures all work product flows through the DMS for proper matter-centric organization and audit trails.
curl -X POST https://cloudimanage.com/auth/oauth2/token \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'grant_type=client_credentials&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET&scope=imanage.document.read imanage.document.write'curl -X GET 'https://cloudimanage.com/work/api/v2/customers/YOUR_CUSTOMER_ID/documents?limit=5' \
-H 'Authorization: Bearer YOUR_ACCESS_TOKEN'iManage API access must be requested through the firm's iManage administrator or iManage support. Ensure the integration application is assigned to a service account with appropriate security group membership — do NOT use an individual attorney's credentials. If the firm uses NetDocuments instead of iManage, substitute the NetDocuments REST API (https://api.vault.netvoyage.com) with equivalent OAuth2 configuration.
Step 6: Clio API Integration Setup
Configure the Clio API connection for matter data synchronization, time entry logging, and client/matter validation. The DD agent reads matter metadata from Clio to scope its analysis and writes time entries for AI-assisted review hours.
curl -X POST https://app.clio.com/oauth/token \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'grant_type=authorization_code&code=AUTH_CODE&redirect_uri=https://dd-reports.firmname.com/clio-callback&client_id=YOUR_CLIENT_ID&client_secret=YOUR_CLIENT_SECRET'curl -X GET 'https://app.clio.com/api/v4/matters?fields=id,display_number,description,client' \
-H 'Authorization: Bearer YOUR_ACCESS_TOKEN'Clio API rate limit is 600 requests per minute per application. The DD agent should batch API calls and implement exponential backoff. Store the refresh token securely in Key Vault — Clio access tokens expire after 1 hour. Ensure the Clio admin has enabled API access for the firm's subscription tier.
Step 7: Deploy Custom DD Agent Application to Azure App Service
Deploy the Python-based due diligence orchestration agent as an Azure Web App. This application coordinates document ingestion, LLM analysis, vector search, and report generation. It exposes a REST API that attorneys trigger from the firm's internal portal.
az appservice plan create --name plan-dd-agent --resource-group rg-legal-dd-agent --sku B2 --is-linux
az webapp create --resource-group rg-legal-dd-agent --plan plan-dd-agent --name app-dd-agent --runtime 'PYTHON:3.11'
az webapp config appsettings set --resource-group rg-legal-dd-agent --name app-dd-agent --settings \
AZURE_OPENAI_ENDPOINT=https://oai-legal-dd.openai.azure.com/ \
AZURE_OPENAI_DEPLOYMENT=gpt-5.4-dd \
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-large \
DOC_INTEL_ENDPOINT=https://doc-intel-legal-dd.cognitiveservices.azure.com/ \
PINECONE_INDEX=legal-dd-corpus \
PINECONE_ENVIRONMENT=us-east-1 \
KEY_VAULT_URI=https://kv-legal-dd.vault.azure.net/ \
CLIO_CLIENT_ID=YOUR_CLIO_CLIENT_ID \
IMANAGE_CUSTOMER_ID=YOUR_IMANAGE_CUSTOMER_ID \
FLASK_ENV=productiongit clone https://github.com/your-msp/legal-dd-agent.git
cd legal-dd-agent
az webapp deploy --resource-group rg-legal-dd-agent --name app-dd-agent --src-path . --type zipaz webapp config hostname add --resource-group rg-legal-dd-agent --webapp-name app-dd-agent --hostname dd-reports.firmname.com
az webapp config ssl bind --resource-group rg-legal-dd-agent --name app-dd-agent --certificate-thumbprint YOUR_CERT_THUMBPRINT --ssl-type SNIThe B2 App Service plan (2 vCPU, 3.5 GB RAM) handles typical DD workloads. Scale to P1V3 if processing more than 5 concurrent transactions. Enable Always On to prevent cold starts. Configure deployment slots for zero-downtime updates. All environment variables reference Key Vault — sensitive values like API keys are retrieved at runtime via managed identity, not stored in app settings.
Step 8: Microsoft 365 and Copilot Configuration
Configure Microsoft 365 E3 licenses, enable Copilot add-ons, configure SharePoint document libraries for DD staging, and set up Teams channels for deal team collaboration with AI notifications.
Connect-MgGraph -Scopes 'User.ReadWrite.All','Organization.Read.All'
$E3Sku = Get-MgSubscribedSku | Where-Object { $_.SkuPartNumber -eq 'SPE_E3' }
$CopilotSku = Get-MgSubscribedSku | Where-Object { $_.SkuPartNumber -eq 'Microsoft_365_Copilot' }
$users = Get-MgUser -Filter "department eq 'Legal'" -All
foreach ($user in $users) {
Set-MgUserLicense -UserId $user.Id -AddLicenses @(@{SkuId=$E3Sku.SkuId},@{SkuId=$CopilotSku.SkuId}) -RemoveLicenses @()
Write-Host "Licensed: $($user.DisplayName)"
}Copilot requires E3 or E5 as a prerequisite — it cannot be added to Business Premium. Some firms may prefer E5 for advanced compliance features (eDiscovery Premium, Information Barriers). SharePoint DD staging site should have sensitivity labels applied for 'Highly Confidential' classification. Configure retention policies to preserve DD documents for 7 years per typical legal retention requirements.
Step 9: Spellbook Installation and Configuration
Deploy Spellbook's Microsoft Word add-in for the 5 power-user attorneys who handle most DD work. Configure firm-specific clause libraries and review templates.
- Configure Spellbook organization settings at https://app.spellbook.legal/settings
- Enable 'Review Mode' as default for DD workflows
- Upload firm clause library (standard acceptable clauses)
- Configure risk sensitivity levels: High/Medium/Low
- Set data handling to 'Do not train on our data'
- Enable audit logging for all AI interactions
Spellbook works as a Word ribbon extension — no separate application to install. Ensure Word is updated to the latest version (at least build 16.0.17000 or later) for full add-in compatibility. Only license 5 power users initially and expand based on adoption metrics. Spellbook's review mode flags non-standard clauses, missing provisions, and potential risks — complementing the custom DD agent's cross-document analysis.
Step 10: Document Scanner Configuration and OCR Pipeline
Install and configure the Fujitsu fi-8170 scanners with PaperStream software, create scan profiles for due diligence documents, and connect the scanning workflow to the Azure AI Document Intelligence OCR pipeline.
300 DPI provides the best balance of OCR accuracy and file size. Do not scan at 600 DPI unless documents contain very small print or complex diagrams. PDF/A format ensures long-term archival compliance. Train firm staff on proper document preparation: remove staples, orient pages correctly, separate double-sided documents. The OCR pipeline in the custom agent automatically processes these scanned PDFs.
Step 11: Firewall and Network Security Configuration
Configure the firm's firewall to allowlist all necessary cloud service endpoints while maintaining security. Set up conditional access policies in Azure AD to restrict AI tool access to managed devices and approved locations.
Do not use broad wildcard rules — be specific about which endpoints are allowed. Enable TLS inspection for non-privileged traffic but EXCLUDE attorney-client communications from TLS inspection to avoid privilege waiver issues. Document all firewall rules in the firm's security policy. Review conditional access policies with the managing partner to ensure they do not impede legitimate remote work during deal closings.
Step 12: Veeam Backup Configuration
Configure Veeam Backup for Microsoft 365 to protect all SharePoint DD staging content, OneDrive attorney files, Exchange mailboxes with deal communications, and Teams deal room conversations.
Test Restore
7-year retention aligns with most state bar record retention requirements. For matters involving SEC-regulated entities, consider extending to 10 years. Enable backup encryption with a key stored separately from the backup repository. Test restores quarterly and document results for compliance audits.
Step 13: End-to-End Integration Testing
Run a complete test transaction through the entire pipeline: upload documents to iManage, trigger the DD agent, verify vector indexing, review AI analysis, and validate the final risk report output.
curl -X POST https://dd-reports.firmname.com/api/v1/review \
-H 'Authorization: Bearer YOUR_JWT_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"matter_id": "TEST-DD-001",
"clio_matter_id": "12345",
"imanage_folder_id": "FOLDER_ID",
"review_type": "full_due_diligence",
"risk_categories": ["change_of_control", "ip_assignment", "indemnification", "termination", "non_compete", "data_privacy", "governing_law"],
"report_format": "pdf_and_json",
"notify_teams_channel": true
}'curl https://dd-reports.firmname.com/api/v1/review/TEST-DD-001/statuscurl https://dd-reports.firmname.com/api/v1/review/TEST-DD-001/report -o test-dd-report.pdfUse real but anonymized documents for testing — synthetic documents may not trigger the same edge cases as actual legal documents. The test should include at least one document with: a change-of-control clause, a non-standard indemnification provision, a missing governing law clause, and a data privacy clause referencing GDPR. Verify that the Teams notification was received in the Deal Room channel. Check that a time entry was created in Clio for the AI-assisted review.
Custom AI Components
Document Ingestion and Preprocessing Pipeline
Type: workflow
Automated pipeline that retrieves documents from iManage or SharePoint, performs OCR on scanned documents via Azure AI Document Intelligence, extracts clean text, chunks documents semantically, generates embeddings via text-embedding-3-large, and stores them in Pinecone with metadata including document type, matter ID, date, and source filename. This pipeline runs automatically when a new DD review is triggered.
Implementation
# document_ingestion.py
import os
import io
import hashlib
from typing import List, Dict, Any
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from openai import AzureOpenAI
from pinecone import Pinecone
import requests
import tiktoken
# Configuration
KEY_VAULT_URI = os.environ['KEY_VAULT_URI']
credential = DefaultAzureCredential()
kv_client = SecretClient(vault_url=KEY_VAULT_URI, credential=credential)
AZURE_OPENAI_ENDPOINT = os.environ['AZURE_OPENAI_ENDPOINT']
AZURE_OPENAI_KEY = kv_client.get_secret('AzureOpenAIKey').value
DOC_INTEL_ENDPOINT = os.environ['DOC_INTEL_ENDPOINT']
DOC_INTEL_KEY = kv_client.get_secret('DocIntelKey').value
PINECONE_API_KEY = kv_client.get_secret('PineconeApiKey').value
PINECONE_INDEX = os.environ['PINECONE_INDEX']
# Initialize clients
oai_client = AzureOpenAI(
api_key=AZURE_OPENAI_KEY,
api_version='2024-06-01',
azure_endpoint=AZURE_OPENAI_ENDPOINT
)
doc_intel_client = DocumentAnalysisClient(
endpoint=DOC_INTEL_ENDPOINT,
credential=AzureKeyCredential(DOC_INTEL_KEY)
)
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(PINECONE_INDEX)
encoding = tiktoken.encoding_for_model('gpt-5.4')
def fetch_documents_from_imanage(access_token: str, customer_id: str, folder_id: str) -> List[Dict]:
"""Retrieve all documents from an iManage folder."""
headers = {'Authorization': f'Bearer {access_token}'}
url = f'https://cloudimanage.com/work/api/v2/customers/{customer_id}/folders/{folder_id}/documents'
response = requests.get(url, headers=headers)
response.raise_for_status()
documents = response.json()['data']
result = []
for doc in documents:
doc_url = f'https://cloudimanage.com/work/api/v2/customers/{customer_id}/documents/{doc["id"]}/download'
doc_response = requests.get(doc_url, headers=headers)
result.append({
'id': doc['id'],
'name': doc['name'],
'extension': doc.get('extension', 'pdf'),
'content': doc_response.content,
'metadata': {
'author': doc.get('author', 'Unknown'),
'create_date': doc.get('create_date', ''),
'document_class': doc.get('class', 'General')
}
})
return result
def ocr_document(content: bytes, filename: str) -> str:
"""Extract text from a document using Azure AI Document Intelligence."""
poller = doc_intel_client.begin_analyze_document(
'prebuilt-layout',
document=io.BytesIO(content)
)
result = poller.result()
full_text = ''
for page in result.pages:
for line in page.lines:
full_text += line.content + '\n'
# Also extract tables
for table in result.tables:
full_text += '\n[TABLE]\n'
for cell in sorted(table.cells, key=lambda c: (c.row_index, c.column_index)):
full_text += f'Row {cell.row_index}, Col {cell.column_index}: {cell.content}\n'
full_text += '[/TABLE]\n'
return full_text
def semantic_chunk(text: str, max_tokens: int = 1500, overlap_tokens: int = 200) -> List[str]:
"""Split text into semantically meaningful chunks with overlap."""
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ''
current_tokens = 0
for para in paragraphs:
para_tokens = len(encoding.encode(para))
if current_tokens + para_tokens > max_tokens and current_chunk:
chunks.append(current_chunk.strip())
# Overlap: keep last portion
overlap_text = current_chunk.split('\n\n')[-2:] if len(current_chunk.split('\n\n')) > 2 else []
current_chunk = '\n\n'.join(overlap_text) + '\n\n' + para
current_tokens = len(encoding.encode(current_chunk))
else:
current_chunk += '\n\n' + para
current_tokens += para_tokens
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
def generate_embeddings(texts: List[str]) -> List[List[float]]:
"""Generate embeddings using text-embedding-3-large."""
response = oai_client.embeddings.create(
input=texts,
model='text-embedding-3-large'
)
return [item.embedding for item in response.data]
def ingest_document_package(matter_id: str, imanage_token: str, customer_id: str, folder_id: str) -> Dict[str, Any]:
"""Full ingestion pipeline for a DD document package."""
documents = fetch_documents_from_imanage(imanage_token, customer_id, folder_id)
stats = {'total_documents': len(documents), 'total_chunks': 0, 'total_pages_ocr': 0}
for doc in documents:
# OCR and text extraction
text = ocr_document(doc['content'], doc['name'])
stats['total_pages_ocr'] += text.count('\n') // 50 # Rough page estimate
# Chunk the document
chunks = semantic_chunk(text)
stats['total_chunks'] += len(chunks)
# Generate embeddings in batches of 20
for i in range(0, len(chunks), 20):
batch = chunks[i:i+20]
embeddings = generate_embeddings(batch)
# Upsert to Pinecone
vectors = []
for j, (chunk, embedding) in enumerate(zip(batch, embeddings)):
chunk_id = hashlib.sha256(f'{doc["id"]}_{i+j}'.encode()).hexdigest()[:32]
vectors.append({
'id': chunk_id,
'values': embedding,
'metadata': {
'matter_id': matter_id,
'document_id': doc['id'],
'document_name': doc['name'],
'chunk_index': i + j,
'text': chunk[:8000], # Pinecone metadata limit
'author': doc['metadata']['author'],
'create_date': doc['metadata']['create_date'],
'document_class': doc['metadata']['document_class']
}
})
index.upsert(vectors=vectors, namespace=f'matter-{matter_id}')
return statsDue Diligence Risk Analysis Agent
Type: agent
The core autonomous agent that orchestrates the multi-step due diligence analysis. It implements a ReAct (Reasoning + Acting) loop that: (1) classifies each document by type, (2) extracts key clauses and provisions using targeted prompts, (3) performs cross-document consistency checks via vector search, (4) identifies risks and anomalies, (5) categorizes findings by severity, and (6) generates the structured risk report. The agent uses tool-calling to access vector search, document retrieval, and classification functions. It includes human-in-the-loop checkpoints where attorneys must approve critical risk assessments before the final report is generated.
Implementation
# dd_risk_agent.py
import json
import logging
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, field, asdict
from enum import Enum
from openai import AzureOpenAI
from pinecone import Pinecone
import datetime
logger = logging.getLogger(__name__)
class RiskSeverity(str, Enum):
CRITICAL = 'critical'
HIGH = 'high'
MEDIUM = 'medium'
LOW = 'low'
INFO = 'informational'
class ReviewStatus(str, Enum):
PENDING = 'pending'
IN_PROGRESS = 'in_progress'
AWAITING_HUMAN_REVIEW = 'awaiting_human_review'
COMPLETED = 'completed'
FAILED = 'failed'
@dataclass
class RiskFinding:
finding_id: str
category: str
severity: RiskSeverity
title: str
description: str
affected_documents: List[str]
relevant_clauses: List[str]
recommendation: str
confidence_score: float
requires_human_review: bool = False
human_approved: Optional[bool] = None
@dataclass
class DDReviewState:
matter_id: str
status: ReviewStatus = ReviewStatus.PENDING
documents_classified: List[Dict] = field(default_factory=list)
clause_extractions: List[Dict] = field(default_factory=list)
cross_doc_findings: List[Dict] = field(default_factory=list)
risk_findings: List[RiskFinding] = field(default_factory=list)
human_checkpoints: List[Dict] = field(default_factory=list)
processing_log: List[str] = field(default_factory=list)
RISK_CATEGORIES = [
'change_of_control',
'ip_assignment_and_ownership',
'indemnification_and_liability',
'termination_and_expiration',
'non_compete_and_non_solicit',
'data_privacy_and_security',
'governing_law_and_jurisdiction',
'consent_and_notice_requirements',
'material_adverse_change',
'representations_and_warranties',
'payment_and_financial_terms',
'insurance_requirements',
'environmental_compliance',
'employment_and_labor'
]
SYSTEM_PROMPT = """You are a senior legal due diligence analyst AI agent. Your role is to systematically review documents in a transaction and identify risks, anomalies, missing provisions, and inconsistencies.
You have access to the following tools:
1. classify_document - Classify a document by type and relevance
2. extract_clauses - Extract specific clause types from a document
3. search_related_clauses - Search across all documents for related or contradictory clauses
4. assess_risk - Evaluate a specific finding and assign risk severity
5. request_human_review - Flag a finding for attorney review (use for CRITICAL findings)
IMPORTANT RULES:
- Always cite specific document names and clause text in findings
- Never fabricate clause text - only reference text you have actually retrieved
- Assign confidence scores honestly (0.0-1.0)
- Flag ANY finding rated CRITICAL for human review
- Consider cross-border implications for international transactions
- Check for consistency of defined terms across all documents
- Identify missing standard provisions (e.g., no governing law clause)
- Note any unusual or non-market terms
Risk severity guidelines:
- CRITICAL: Deal-breaker issues requiring immediate attorney attention (e.g., undisclosed litigation, missing IP assignments, change of control triggers that would terminate key contracts)
- HIGH: Significant issues that may affect deal terms or valuation (e.g., broad indemnification obligations, restrictive non-competes, unfavorable termination provisions)
- MEDIUM: Notable issues requiring negotiation or disclosure (e.g., non-standard payment terms, ambiguous definitions, consent requirements for assignment)
- LOW: Minor issues or standard deviations from market norms (e.g., slightly non-standard notice periods, minor formatting inconsistencies)
- INFORMATIONAL: Observations for awareness (e.g., contracts nearing expiration, standard renewal provisions)"""
TOOLS = [
{
'type': 'function',
'function': {
'name': 'classify_document',
'description': 'Classify a document by type (contract, corporate filing, financial statement, IP record, etc.) and assess its relevance to the transaction.',
'parameters': {
'type': 'object',
'properties': {
'document_name': {'type': 'string'},
'document_text_preview': {'type': 'string', 'description': 'First 2000 characters of the document'},
'classification': {'type': 'string', 'enum': ['contract', 'amendment', 'corporate_filing', 'financial_statement', 'ip_record', 'real_estate', 'employment_agreement', 'regulatory_filing', 'litigation_record', 'insurance_policy', 'other']},
'relevance_score': {'type': 'number', 'minimum': 0, 'maximum': 1},
'key_parties': {'type': 'array', 'items': {'type': 'string'}},
'effective_date': {'type': 'string'},
'expiration_date': {'type': 'string'}
},
'required': ['document_name', 'classification', 'relevance_score']
}
}
},
{
'type': 'function',
'function': {
'name': 'extract_clauses',
'description': 'Extract specific clause types from a document chunk. Call this for each risk category you need to analyze.',
'parameters': {
'type': 'object',
'properties': {
'document_name': {'type': 'string'},
'clause_category': {'type': 'string', 'enum': RISK_CATEGORIES},
'extracted_text': {'type': 'string', 'description': 'The exact clause text found'},
'clause_summary': {'type': 'string'},
'is_standard': {'type': 'boolean', 'description': 'Whether this is a market-standard provision'},
'notable_deviations': {'type': 'string'}
},
'required': ['document_name', 'clause_category', 'extracted_text', 'clause_summary', 'is_standard']
}
}
},
{
'type': 'function',
'function': {
'name': 'search_related_clauses',
'description': 'Search the full document corpus for clauses related to a specific topic or that might contradict a given clause.',
'parameters': {
'type': 'object',
'properties': {
'query': {'type': 'string', 'description': 'Semantic search query for related clauses'},
'category_filter': {'type': 'string'},
'top_k': {'type': 'integer', 'default': 10}
},
'required': ['query']
}
}
},
{
'type': 'function',
'function': {
'name': 'assess_risk',
'description': 'Record a risk finding with severity assessment and recommendation.',
'parameters': {
'type': 'object',
'properties': {
'category': {'type': 'string', 'enum': RISK_CATEGORIES},
'severity': {'type': 'string', 'enum': ['critical', 'high', 'medium', 'low', 'informational']},
'title': {'type': 'string'},
'description': {'type': 'string'},
'affected_documents': {'type': 'array', 'items': {'type': 'string'}},
'relevant_clause_text': {'type': 'array', 'items': {'type': 'string'}},
'recommendation': {'type': 'string'},
'confidence_score': {'type': 'number', 'minimum': 0, 'maximum': 1}
},
'required': ['category', 'severity', 'title', 'description', 'affected_documents', 'recommendation', 'confidence_score']
}
}
},
{
'type': 'function',
'function': {
'name': 'request_human_review',
'description': 'Flag a critical finding for mandatory attorney review before including in the final report.',
'parameters': {
'type': 'object',
'properties': {
'finding_title': {'type': 'string'},
'reason': {'type': 'string'},
'urgency': {'type': 'string', 'enum': ['immediate', 'before_report', 'informational']}
},
'required': ['finding_title', 'reason', 'urgency']
}
}
}
]
class DDRiskAgent:
def __init__(self, oai_client: AzureOpenAI, pinecone_index, deployment_name: str = 'gpt-5.4-dd', embedding_deployment: str = 'text-embedding-3-large'):
self.oai_client = oai_client
self.index = pinecone_index
self.deployment = deployment_name
self.embedding_deployment = embedding_deployment
def _vector_search(self, query: str, matter_id: str, top_k: int = 10, category_filter: str = None) -> List[Dict]:
"""Search the document corpus for relevant chunks."""
embedding = self.oai_client.embeddings.create(
input=[query], model=self.embedding_deployment
).data[0].embedding
filter_dict = {'matter_id': matter_id}
if category_filter:
filter_dict['document_class'] = category_filter
results = self.index.query(
vector=embedding,
top_k=top_k,
include_metadata=True,
namespace=f'matter-{matter_id}',
filter=filter_dict
)
return [{'score': m.score, 'text': m.metadata.get('text', ''), 'document_name': m.metadata.get('document_name', ''), 'chunk_index': m.metadata.get('chunk_index', 0)} for m in results.matches]
def _execute_tool(self, tool_name: str, arguments: Dict, state: DDReviewState) -> str:
"""Execute a tool call and update state."""
if tool_name == 'classify_document':
state.documents_classified.append(arguments)
return json.dumps({'status': 'classified', 'document': arguments.get('document_name'), 'type': arguments.get('classification')})
elif tool_name == 'extract_clauses':
state.clause_extractions.append(arguments)
return json.dumps({'status': 'extracted', 'document': arguments.get('document_name'), 'category': arguments.get('clause_category')})
elif tool_name == 'search_related_clauses':
results = self._vector_search(
query=arguments.get('query', ''),
matter_id=state.matter_id,
top_k=arguments.get('top_k', 10),
category_filter=arguments.get('category_filter')
)
state.cross_doc_findings.append({'query': arguments.get('query'), 'results_count': len(results)})
return json.dumps({'results': results})
elif tool_name == 'assess_risk':
finding = RiskFinding(
finding_id=f'RF-{len(state.risk_findings)+1:04d}',
category=arguments.get('category', 'other'),
severity=RiskSeverity(arguments.get('severity', 'medium')),
title=arguments.get('title', ''),
description=arguments.get('description', ''),
affected_documents=arguments.get('affected_documents', []),
relevant_clauses=arguments.get('relevant_clause_text', []),
recommendation=arguments.get('recommendation', ''),
confidence_score=arguments.get('confidence_score', 0.5),
requires_human_review=(arguments.get('severity') == 'critical')
)
state.risk_findings.append(finding)
return json.dumps({'status': 'recorded', 'finding_id': finding.finding_id, 'severity': finding.severity.value})
elif tool_name == 'request_human_review':
checkpoint = {
'finding_title': arguments.get('finding_title'),
'reason': arguments.get('reason'),
'urgency': arguments.get('urgency'),
'timestamp': datetime.datetime.utcnow().isoformat(),
'resolved': False
}
state.human_checkpoints.append(checkpoint)
return json.dumps({'status': 'flagged_for_review', 'checkpoint': checkpoint})
return json.dumps({'error': f'Unknown tool: {tool_name}'})
def run_analysis(self, matter_id: str, document_summaries: List[Dict], risk_categories: List[str] = None) -> DDReviewState:
"""Run the full DD analysis agent loop."""
state = DDReviewState(matter_id=matter_id, status=ReviewStatus.IN_PROGRESS)
categories = risk_categories or RISK_CATEGORIES
# Build initial context with document summaries
doc_context = 'DOCUMENTS IN THIS DUE DILIGENCE PACKAGE:\n'
for doc in document_summaries:
doc_context += f"- {doc['name']} (first 2000 chars): {doc['preview'][:2000]}\n\n"
user_prompt = f"""Conduct a comprehensive due diligence review of the following document package for matter {matter_id}.
{doc_context}
Analyze these risk categories: {', '.join(categories)}
Procedure:
1. First, classify each document by type and relevance
2. For each relevant document, extract key clauses in each risk category
3. Use search_related_clauses to find cross-document inconsistencies and contradictions
4. Assess each finding with appropriate severity
5. Flag any CRITICAL findings for human review
6. Ensure you check for: missing standard provisions, non-standard terms, cross-document term definition consistency, and assignment/change-of-control implications
Be thorough and systematic. Process every document."""
messages = [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': user_prompt}
]
max_iterations = 50 # Safety limit
iteration = 0
while iteration < max_iterations:
iteration += 1
state.processing_log.append(f'Iteration {iteration}')
response = self.oai_client.chat.completions.create(
model=self.deployment,
messages=messages,
tools=TOOLS,
tool_choice='auto',
temperature=0.1,
max_tokens=4096
)
choice = response.choices[0]
messages.append(choice.message)
if choice.finish_reason == 'stop':
state.processing_log.append('Agent completed analysis')
break
if choice.finish_reason == 'tool_calls':
for tool_call in choice.message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
state.processing_log.append(f'Tool call: {fn_name}')
result = self._execute_tool(fn_name, fn_args, state)
messages.append({
'role': 'tool',
'tool_call_id': tool_call.id,
'content': result
})
# Determine final status
if any(cp for cp in state.human_checkpoints if not cp['resolved']):
state.status = ReviewStatus.AWAITING_HUMAN_REVIEW
else:
state.status = ReviewStatus.COMPLETED
return stateRisk Report Generator
Type: workflow Generates a structured PDF and JSON risk report from the DDReviewState produced by the analysis agent. The report includes an executive summary, document inventory, findings organized by severity and category, cross-reference matrix, and recommended next steps. Outputs both a formatted PDF for attorney review and a structured JSON for programmatic consumption and integration with Clio matter notes.
Implementation
# report_generator.py
import json
import datetime
from typing import List, Dict, Any
from dataclasses import asdict
from openai import AzureOpenAI
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.colors import HexColor
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
from reportlab.lib.units import inch
from io import BytesIO
class DDReportGenerator:
def __init__(self, oai_client: AzureOpenAI, deployment: str = 'gpt-5.4-dd'):
self.oai_client = oai_client
self.deployment = deployment
def _generate_executive_summary(self, state) -> str:
"""Use LLM to generate a natural-language executive summary."""
findings_summary = json.dumps([{
'id': f.finding_id, 'severity': f.severity.value,
'title': f.title, 'category': f.category
} for f in state.risk_findings], indent=2)
prompt = f"""Based on the following due diligence findings, write a concise executive summary (3-4 paragraphs) suitable for a senior partner reviewing this DD report.
Matter ID: {state.matter_id}
Total documents reviewed: {len(state.documents_classified)}
Total findings: {len(state.risk_findings)}
Critical findings: {sum(1 for f in state.risk_findings if f.severity.value == 'critical')}
High findings: {sum(1 for f in state.risk_findings if f.severity.value == 'high')}
Findings overview:
{findings_summary}
Write in professional legal tone. Highlight deal-breaker issues first. Note areas that require further investigation. Do NOT provide legal advice — present findings for attorney evaluation."""
response = self.oai_client.chat.completions.create(
model=self.deployment,
messages=[{'role': 'user', 'content': prompt}],
temperature=0.3, max_tokens=2000
)
return response.choices[0].message.content
def generate_json_report(self, state) -> Dict[str, Any]:
"""Generate structured JSON report."""
exec_summary = self._generate_executive_summary(state)
severity_order = {'critical': 0, 'high': 1, 'medium': 2, 'low': 3, 'informational': 4}
sorted_findings = sorted(state.risk_findings, key=lambda f: severity_order.get(f.severity.value, 5))
report = {
'report_metadata': {
'matter_id': state.matter_id,
'generated_at': datetime.datetime.utcnow().isoformat(),
'report_version': '1.0',
'agent_version': 'dd-agent-v1.0',
'status': state.status.value,
'disclaimer': 'This report was generated by an AI due diligence agent and must be reviewed by a licensed attorney before reliance. AI-generated findings may contain errors or omissions. This report does not constitute legal advice.'
},
'executive_summary': exec_summary,
'statistics': {
'total_documents': len(state.documents_classified),
'total_findings': len(state.risk_findings),
'by_severity': {
'critical': sum(1 for f in state.risk_findings if f.severity.value == 'critical'),
'high': sum(1 for f in state.risk_findings if f.severity.value == 'high'),
'medium': sum(1 for f in state.risk_findings if f.severity.value == 'medium'),
'low': sum(1 for f in state.risk_findings if f.severity.value == 'low'),
'informational': sum(1 for f in state.risk_findings if f.severity.value == 'informational')
},
'human_review_required': len([cp for cp in state.human_checkpoints if not cp['resolved']])
},
'document_inventory': state.documents_classified,
'findings': [asdict(f) for f in sorted_findings],
'human_review_checkpoints': state.human_checkpoints,
'processing_log_summary': {
'total_iterations': len(state.processing_log),
'tool_calls': len([l for l in state.processing_log if 'Tool call' in l])
}
}
return report
def generate_pdf_report(self, state, output_path: str = None) -> bytes:
"""Generate formatted PDF report."""
buffer = BytesIO()
doc = SimpleDocTemplate(buffer, pagesize=letter, topMargin=0.75*inch, bottomMargin=0.75*inch)
styles = getSampleStyleSheet()
# Custom styles
title_style = ParagraphStyle('DDTitle', parent=styles['Title'], fontSize=18, textColor=HexColor('#1a237e'))
heading_style = ParagraphStyle('DDHeading', parent=styles['Heading2'], textColor=HexColor('#1a237e'), spaceAfter=12)
severity_colors = {'critical': '#d32f2f', 'high': '#f57c00', 'medium': '#fbc02d', 'low': '#388e3c', 'informational': '#1976d2'}
story = []
# Title page
story.append(Paragraph('DUE DILIGENCE RISK REPORT', title_style))
story.append(Spacer(1, 12))
story.append(Paragraph(f'Matter: {state.matter_id}', styles['Normal']))
story.append(Paragraph(f'Generated: {datetime.datetime.utcnow().strftime("%B %d, %Y at %H:%M UTC")}', styles['Normal']))
story.append(Paragraph('CONFIDENTIAL — ATTORNEY WORK PRODUCT', ParagraphStyle('Confidential', parent=styles['Normal'], textColor=HexColor('#d32f2f'), fontSize=12, spaceAfter=24)))
story.append(Paragraph('DISCLAIMER: This report was generated by an AI system and must be reviewed and validated by a licensed attorney before any reliance. Findings may contain errors or omissions. This does not constitute legal advice.', ParagraphStyle('Disclaimer', parent=styles['Normal'], backColor=HexColor('#fff3e0'), fontSize=9, spaceAfter=24)))
# Executive Summary
exec_summary = self._generate_executive_summary(state)
story.append(Paragraph('EXECUTIVE SUMMARY', heading_style))
for para in exec_summary.split('\n\n'):
story.append(Paragraph(para, styles['Normal']))
story.append(Spacer(1, 6))
# Statistics table
story.append(PageBreak())
story.append(Paragraph('FINDINGS SUMMARY', heading_style))
stats_data = [['Severity', 'Count']]
for sev in ['critical', 'high', 'medium', 'low', 'informational']:
count = sum(1 for f in state.risk_findings if f.severity.value == sev)
stats_data.append([sev.upper(), str(count)])
stats_table = Table(stats_data, colWidths=[3*inch, 2*inch])
stats_table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), HexColor('#1a237e')),
('TEXTCOLOR', (0, 0), (-1, 0), HexColor('#ffffff')),
('GRID', (0, 0), (-1, -1), 0.5, HexColor('#cccccc')),
('FONTSIZE', (0, 0), (-1, -1), 10),
('PADDING', (0, 0), (-1, -1), 8)
]))
story.append(stats_table)
story.append(Spacer(1, 24))
# Detailed findings
story.append(Paragraph('DETAILED FINDINGS', heading_style))
severity_order = {'critical': 0, 'high': 1, 'medium': 2, 'low': 3, 'informational': 4}
for finding in sorted(state.risk_findings, key=lambda f: severity_order.get(f.severity.value, 5)):
color = severity_colors.get(finding.severity.value, '#000000')
story.append(Paragraph(f'<font color="{color}"><b>[{finding.severity.value.upper()}]</b></font> {finding.finding_id}: {finding.title}', styles['Heading3']))
story.append(Paragraph(f'<b>Category:</b> {finding.category.replace("_", " ").title()}', styles['Normal']))
story.append(Paragraph(f'<b>Description:</b> {finding.description}', styles['Normal']))
story.append(Paragraph(f'<b>Affected Documents:</b> {", ".join(finding.affected_documents)}', styles['Normal']))
story.append(Paragraph(f'<b>Recommendation:</b> {finding.recommendation}', styles['Normal']))
story.append(Paragraph(f'<b>Confidence:</b> {finding.confidence_score:.0%}', styles['Normal']))
if finding.requires_human_review:
story.append(Paragraph('<font color="#d32f2f"><b>⚠ REQUIRES ATTORNEY REVIEW BEFORE FINALIZATION</b></font>', styles['Normal']))
story.append(Spacer(1, 16))
doc.build(story)
pdf_bytes = buffer.getvalue()
if output_path:
with open(output_path, 'wb') as f:
f.write(pdf_bytes)
return pdf_bytesHuman-in-the-Loop Review Portal
Type: integration A lightweight Flask web application that serves as the attorney review interface. Attorneys receive Teams/email notifications when the DD agent flags critical findings for human review. They access this portal to review flagged findings, approve or modify risk assessments, add comments, and authorize final report generation. This satisfies ABA Model Rule 5.3 supervisory obligations.
Implementation
# review_portal.py
from flask import Flask, request, jsonify, render_template_string
from functools import wraps
import jwt
import datetime
import json
from typing import Dict
import requests
app = Flask(__name__)
app.config['SECRET_KEY'] = 'loaded-from-key-vault-at-startup'
# In-memory store (replace with Azure Cosmos DB or PostgreSQL in production)
review_queue: Dict[str, dict] = {}
REVIEW_PAGE_TEMPLATE = """
<!DOCTYPE html>
<html>
<head><title>DD Review Portal - {{ matter_id }}</title>
<style>
body { font-family: 'Segoe UI', sans-serif; max-width: 900px; margin: 0 auto; padding: 20px; }
.finding { border: 1px solid #ddd; border-radius: 8px; padding: 16px; margin: 12px 0; }
.critical { border-left: 4px solid #d32f2f; }
.high { border-left: 4px solid #f57c00; }
.badge { display: inline-block; padding: 2px 8px; border-radius: 4px; color: white; font-size: 12px; }
.badge-critical { background: #d32f2f; }
.badge-high { background: #f57c00; }
.btn { padding: 8px 16px; border: none; border-radius: 4px; cursor: pointer; margin: 4px; }
.btn-approve { background: #388e3c; color: white; }
.btn-modify { background: #1976d2; color: white; }
.btn-reject { background: #d32f2f; color: white; }
textarea { width: 100%; min-height: 60px; margin: 8px 0; }
.disclaimer { background: #fff3e0; padding: 12px; border-radius: 4px; margin-bottom: 20px; }
</style></head>
<body>
<h1>Due Diligence Review Portal</h1>
<p>Matter: <strong>{{ matter_id }}</strong></p>
<div class='disclaimer'>Items below have been flagged by the AI agent as requiring attorney review per ABA Model Rule 5.3. Please review each finding, verify accuracy against source documents, and approve, modify, or reject.</div>
{% for finding in findings %}
<div class='finding {{ finding.severity }}'>
<span class='badge badge-{{ finding.severity }}'>{{ finding.severity | upper }}</span>
<h3>{{ finding.finding_id }}: {{ finding.title }}</h3>
<p><strong>Category:</strong> {{ finding.category }}</p>
<p><strong>Description:</strong> {{ finding.description }}</p>
<p><strong>Affected Documents:</strong> {{ finding.affected_documents | join(', ') }}</p>
<p><strong>AI Recommendation:</strong> {{ finding.recommendation }}</p>
<p><strong>AI Confidence:</strong> {{ (finding.confidence_score * 100) | int }}%</p>
<form method='POST' action='/api/v1/review/{{ matter_id }}/findings/{{ finding.finding_id }}'>
<textarea name='attorney_notes' placeholder='Attorney notes (required for modifications)'></textarea>
<select name='modified_severity'>
<option value=''>Keep current severity</option>
<option value='critical'>Critical</option>
<option value='high'>High</option>
<option value='medium'>Medium</option>
<option value='low'>Low</option>
<option value='informational'>Informational</option>
</select>
<br>
<button type='submit' name='action' value='approve' class='btn btn-approve'>✓ Approve Finding</button>
<button type='submit' name='action' value='modify' class='btn btn-modify'>✎ Modify & Approve</button>
<button type='submit' name='action' value='reject' class='btn btn-reject'>✗ Reject Finding</button>
</form>
</div>
{% endfor %}
<form method='POST' action='/api/v1/review/{{ matter_id }}/finalize'>
<button type='submit' class='btn btn-approve' style='font-size:16px; padding:12px 24px; margin-top:24px;'>Generate Final Report</button>
</form>
</body></html>
"""
def require_auth(f):
@wraps(f)
def decorated(*args, **kwargs):
token = request.headers.get('Authorization', '').replace('Bearer ', '')
if not token:
token = request.cookies.get('auth_token', '')
try:
payload = jwt.decode(token, app.config['SECRET_KEY'], algorithms=['HS256'])
request.user = payload
except jwt.InvalidTokenError:
return jsonify({'error': 'Authentication required'}), 401
return f(*args, **kwargs)
return decorated
@app.route('/api/v1/review/<matter_id>', methods=['GET'])
@require_auth
def get_review_page(matter_id):
if matter_id not in review_queue:
return jsonify({'error': 'Matter not found'}), 404
state = review_queue[matter_id]
flagged = [f for f in state['findings'] if f.get('requires_human_review', False) and not f.get('human_approved')]
return render_template_string(REVIEW_PAGE_TEMPLATE, matter_id=matter_id, findings=flagged)
@app.route('/api/v1/review/<matter_id>/findings/<finding_id>', methods=['POST'])
@require_auth
def review_finding(matter_id, finding_id):
action = request.form.get('action')
notes = request.form.get('attorney_notes', '')
modified_severity = request.form.get('modified_severity', '')
state = review_queue.get(matter_id)
if not state:
return jsonify({'error': 'Matter not found'}), 404
for finding in state['findings']:
if finding['finding_id'] == finding_id:
finding['human_approved'] = (action in ['approve', 'modify'])
finding['attorney_notes'] = notes
finding['reviewed_by'] = request.user.get('email', 'unknown')
finding['reviewed_at'] = datetime.datetime.utcnow().isoformat()
if action == 'modify' and modified_severity:
finding['severity'] = modified_severity
if action == 'reject':
finding['rejected'] = True
break
return jsonify({'status': 'updated', 'finding_id': finding_id, 'action': action})
@app.route('/api/v1/review/<matter_id>/finalize', methods=['POST'])
@require_auth
def finalize_report(matter_id):
state = review_queue.get(matter_id)
if not state:
return jsonify({'error': 'Matter not found'}), 404
unreviewed = [f for f in state['findings'] if f.get('requires_human_review') and not f.get('human_approved') and not f.get('rejected')]
if unreviewed:
return jsonify({'error': f'{len(unreviewed)} findings still require review', 'unreviewed_ids': [f['finding_id'] for f in unreviewed]}), 400
# Trigger report generation (calls report_generator)
state['finalized'] = True
state['finalized_by'] = request.user.get('email')
state['finalized_at'] = datetime.datetime.utcnow().isoformat()
return jsonify({'status': 'finalized', 'message': 'Report generation initiated'})
def send_teams_notification(webhook_url: str, matter_id: str, critical_count: int, total_count: int):
"""Send notification to Teams deal room when review is needed."""
card = {
'@type': 'MessageCard',
'themeColor': 'd32f2f' if critical_count > 0 else 'f57c00',
'summary': f'DD Review Required: {matter_id}',
'sections': [{
'activityTitle': f'Due Diligence Review Required: {matter_id}',
'facts': [
{'name': 'Total Findings', 'value': str(total_count)},
{'name': 'Critical (Needs Review)', 'value': str(critical_count)},
{'name': 'Generated', 'value': datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}
],
'markdown': True
}],
'potentialAction': [{
'@type': 'OpenUri',
'name': 'Open Review Portal',
'targets': [{'os': 'default', 'uri': f'https://dd-reports.firmname.com/api/v1/review/{matter_id}'}]
}]
}
requests.post(webhook_url, json=card)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)Clio Time Entry and Matter Sync Integration
Type: integration Automatically logs time entries in Clio for AI-assisted due diligence work, syncs matter metadata for agent context, and posts DD report summaries as matter notes. This ensures proper billing attribution and matter documentation.
Implementation
# Clio API integration for time entry logging, matter note posting, and DD
# completion sync
# clio_integration.py
import requests
import datetime
from typing import Dict, Optional
class ClioIntegration:
BASE_URL = 'https://app.clio.com/api/v4'
def __init__(self, access_token: str, refresh_token: str, client_id: str, client_secret: str):
self.access_token = access_token
self.refresh_token = refresh_token
self.client_id = client_id
self.client_secret = client_secret
def _headers(self) -> Dict:
return {'Authorization': f'Bearer {self.access_token}', 'Content-Type': 'application/json'}
def _refresh_token_if_needed(self):
"""Refresh OAuth token if expired."""
response = requests.post(f'{self.BASE_URL}/../oauth/token', data={
'grant_type': 'refresh_token',
'refresh_token': self.refresh_token,
'client_id': self.client_id,
'client_secret': self.client_secret
})
if response.ok:
data = response.json()
self.access_token = data['access_token']
self.refresh_token = data['refresh_token']
def get_matter(self, matter_id: str) -> Dict:
"""Retrieve matter details for agent context."""
response = requests.get(
f'{self.BASE_URL}/matters/{matter_id}',
headers=self._headers(),
params={'fields': 'id,display_number,description,client,practice_area,status,custom_field_values'}
)
response.raise_for_status()
return response.json()['data']
def log_time_entry(self, matter_id: str, user_id: str, duration_seconds: int, description: str, activity_description_id: Optional[str] = None) -> Dict:
"""Log a time entry for AI-assisted DD work."""
entry = {
'data': {
'date': datetime.date.today().isoformat(),
'quantity': duration_seconds,
'quantity_in_hours': round(duration_seconds / 3600, 2),
'type': 'TimeEntry',
'note': f'[AI-Assisted] {description}',
'matter': {'id': int(matter_id)},
'user': {'id': int(user_id)}
}
}
if activity_description_id:
entry['data']['activity_description'] = {'id': int(activity_description_id)}
response = requests.post(f'{self.BASE_URL}/activities', headers=self._headers(), json=entry)
response.raise_for_status()
return response.json()['data']
def post_matter_note(self, matter_id: str, subject: str, detail: str) -> Dict:
"""Post a note to the matter with DD report summary."""
note = {
'data': {
'subject': subject,
'detail': detail,
'type': 'Note',
'matter': {'id': int(matter_id)},
'date': datetime.date.today().isoformat()
}
}
response = requests.post(f'{self.BASE_URL}/notes', headers=self._headers(), json=note)
response.raise_for_status()
return response.json()['data']
def sync_dd_completion(self, matter_id: str, report_summary: Dict, processing_time_seconds: int, reviewing_attorney_user_id: str):
"""Complete sync: log time, post note, update matter."""
# Log AI processing time
self.log_time_entry(
matter_id=matter_id,
user_id=reviewing_attorney_user_id,
duration_seconds=processing_time_seconds,
description=f'AI due diligence review - {report_summary["statistics"]["total_documents"]} documents analyzed, {report_summary["statistics"]["total_findings"]} findings identified'
)
# Post summary note
findings_text = f"""Due Diligence AI Review Completed
Documents Reviewed: {report_summary['statistics']['total_documents']}
Total Findings: {report_summary['statistics']['total_findings']}
Critical: {report_summary['statistics']['by_severity']['critical']}
High: {report_summary['statistics']['by_severity']['high']}
Medium: {report_summary['statistics']['by_severity']['medium']}
Low: {report_summary['statistics']['by_severity']['low']}
Executive Summary:
{report_summary['executive_summary'][:2000]}
Full report saved to iManage matter folder."""
self.post_matter_note(
matter_id=matter_id,
subject=f'DD Risk Report - {report_summary["statistics"]["total_findings"]} Findings',
detail=findings_text
)DD Agent Orchestrator Prompt Library
Type: prompt A versioned library of specialized prompts used by the DD agent for different analysis stages. Each prompt is optimized for legal due diligence accuracy and includes grounding instructions to prevent hallucination. Prompts are stored as configuration files and can be updated without redeploying the agent application.
Implementation
# Versioned prompt library for DD agent analysis stages
# prompts/dd_prompts.yaml
# Version: 1.0.0
# Last Updated: 2025-07-30
# Reviewed By: [MSP Legal AI Lead]
prompt_library:
document_classification:
version: '1.0'
system: |
You are a legal document classifier specializing in M&A due diligence.
Classify the document into one of these categories:
- contract: Any binding agreement between parties
- amendment: Modification to an existing contract
- corporate_filing: Articles of incorporation, bylaws, board resolutions, annual reports
- financial_statement: Balance sheets, income statements, audited financials
- ip_record: Patents, trademarks, copyrights, IP assignments, license agreements
- real_estate: Leases, deeds, environmental reports
- employment_agreement: Offer letters, employment contracts, severance agreements
- regulatory_filing: Government filings, permits, licenses
- litigation_record: Court filings, settlement agreements, demand letters
- insurance_policy: Insurance certificates, policy documents
- other: Documents that don't fit above categories
Also extract: key parties, effective date, expiration date, governing law.
If you cannot determine a field with confidence, respond with 'UNDETERMINED' rather than guessing.
temperature: 0.1
max_tokens: 1000
clause_extraction:
version: '1.0'
system: |
You are a senior contract analyst extracting specific clause types from legal documents.
CRITICAL RULES:
1. Only extract text that is ACTUALLY PRESENT in the provided document chunk
2. Never generate, infer, or fabricate clause text
3. If a clause type is not present in this chunk, respond with 'NOT FOUND IN THIS SECTION'
4. Quote the exact language - do not paraphrase
5. Note any defined terms that are referenced but defined elsewhere
6. Identify if the clause is mutual or one-sided
7. Flag any carve-outs or exceptions within the clause
For each extracted clause, assess:
- Is this market-standard language? (Yes/No/Partially)
- What specific deviations from market standard exist?
- Are there any ambiguities in the language?
- Does this clause reference or depend on other clauses/documents?
temperature: 0.0
max_tokens: 2000
cross_document_analysis:
version: '1.0'
system: |
You are analyzing clauses across multiple documents in a due diligence package to identify:
1. CONTRADICTIONS: Clauses in different documents that conflict with each other
2. INCONSISTENCIES: Different defined terms or standards across documents
3. GAPS: Standard provisions present in some documents but missing from others
4. DEPENDENCIES: Clauses that reference or are contingent on provisions in other documents
5. CUMULATIVE RISK: Individual clauses that are acceptable alone but create risk when combined
For each finding, cite the EXACT document names and clause text.
Rate the significance: Critical / High / Medium / Low / Informational
Explain the practical business impact of each finding.
temperature: 0.1
max_tokens: 3000
change_of_control_analysis:
version: '1.0'
system: |
You are analyzing change-of-control provisions across all contracts in a due diligence package.
For each contract, determine:
1. Is there a change-of-control clause? (Quote exact text)
2. What triggers the clause? (Merger, acquisition, asset sale, board change, >50% ownership change, other)
3. What are the consequences? (Termination right, consent required, acceleration, price adjustment, other)
4. Is consent required for assignment? (Quote exact text)
5. Is there an anti-assignment clause that could prevent the transaction?
6. Are there any exceptions or carve-outs for affiliated entities?
7. What is the notice period required?
8. Would the contemplated transaction trigger this clause?
Assess aggregate risk: How many key contracts would be affected by the transaction?
Identify any contracts that could be deal-breakers if consent is not obtained.
temperature: 0.0
max_tokens: 3000
ip_ownership_analysis:
version: '1.0'
system: |
You are analyzing intellectual property ownership and assignment provisions.
For each document, determine:
1. Are there IP assignment clauses? (work-for-hire, assignment of inventions)
2. Are assignments present-tense ('hereby assigns') or future ('agrees to assign')?
3. Are there any retained rights or licenses back to the assignor?
4. Do employment agreements contain invention assignment clauses?
5. Are there any third-party IP licenses that may not be transferable?
6. Are open-source software obligations disclosed?
7. Is there a complete chain of title from creator to current owner?
8. Are there any IP-related representations and warranties?
Flag any gaps in the IP ownership chain as HIGH or CRITICAL risk.
Flag any licenses that contain anti-assignment or change-of-control provisions.
temperature: 0.0
max_tokens: 3000
risk_report_executive_summary:
version: '1.0'
system: |
You are drafting the executive summary section of a legal due diligence risk report.
RULES:
1. Write in professional, objective legal tone
2. Lead with the most significant findings (Critical and High severity)
3. Quantify: number of documents reviewed, number of findings by severity
4. Identify the top 3-5 risks that could affect deal terms or valuation
5. Note any areas where further investigation is recommended
6. Do NOT provide legal advice or opinions on whether to proceed
7. Do NOT use marketing language or superlatives
8. Include a statement that this is AI-generated and requires attorney review
9. Keep to 3-4 concise paragraphs
10. Use specific document references, not vague generalizations
temperature: 0.3
max_tokens: 2000Testing & Validation
- DOCUMENT INGESTION TEST: Upload a package of 20 mixed-format documents (10 native PDFs, 5 scanned PDFs, 3 Word documents, 2 image files) to iManage test matter folder. Trigger ingestion pipeline. Verify all 20 documents are processed, OCR produces readable text for scanned documents, and all chunks are indexed in Pinecone namespace 'matter-TEST-DD-001'. Expected: 100% document ingestion success rate, OCR accuracy >95% on scanned documents.
- VECTOR SEARCH ACCURACY TEST: After ingesting test documents, perform 10 semantic searches for known clause types (e.g., 'change of control provision', 'indemnification cap'). Verify that the top-3 results for each query contain the correct document chunks. Expected: At least 8 out of 10 queries return the correct document in the top-3 results.
- AGENT ANALYSIS COMPLETENESS TEST: Run the full DD agent on a test package containing documents with known planted issues: (1) a contract with a change-of-control termination trigger, (2) a missing governing law clause in one agreement, (3) contradictory indemnification caps across two contracts, (4) an IP assignment with only future-tense language. Verify the agent identifies all 4 issues. Expected: 100% detection of planted issues.
- HUMAN-IN-THE-LOOP CHECKPOINT TEST: Trigger a DD review that generates at least one CRITICAL finding. Verify: (1) Teams notification is sent to the Deal Room channel within 2 minutes, (2) the review portal URL in the notification is accessible, (3) the portal displays the flagged finding with approve/modify/reject options, (4) approving the finding updates the state correctly, (5) the final report cannot be generated until all critical findings are reviewed. Expected: All 5 verification points pass.
- RISK REPORT GENERATION TEST: After completing a test analysis, generate both PDF and JSON reports. Verify: (1) PDF is properly formatted with all sections (executive summary, statistics, findings, disclaimer), (2) JSON schema validates correctly, (3) findings are sorted by severity, (4) all finding IDs are unique, (5) confidence scores are between 0 and 1, (6) the AI disclaimer appears on the first page of the PDF. Expected: Both reports generate without errors and pass all verification points.
- CLIO TIME ENTRY TEST: After completing a DD review, verify: (1) a time entry is created in Clio for the reviewing attorney with the correct matter ID, (2) the time entry description includes '[AI-Assisted]' prefix, (3) a matter note is posted with the DD summary statistics, (4) the note contains the executive summary text. Expected: All Clio entries are created and visible in the attorney's Clio dashboard.
- IMANAGE WRITE-BACK TEST: After report generation, verify: (1) the PDF report is saved to the correct iManage matter folder, (2) the document is classified with the correct document class, (3) the document metadata includes the generation timestamp and agent version, (4) the document is accessible to attorneys with matter access. Expected: Report appears in iManage within 30 seconds of generation.
- SECURITY AND ACCESS CONTROL TEST: Attempt to access the review portal without authentication (should return 401). Attempt to access a matter belonging to a different practice group (should return 403). Verify that API keys are not exposed in application logs. Verify that TLS 1.2+ is enforced on all endpoints. Check that Azure OpenAI audit logs capture all API calls. Expected: All security controls function correctly.
- PERFORMANCE AND SCALABILITY TEST: Process a large DD package of 100 documents (approximately 2,000 pages total). Measure: (1) total ingestion time, (2) total agent analysis time, (3) report generation time. Expected: Ingestion completes within 30 minutes, analysis completes within 60 minutes, report generation completes within 5 minutes. Total end-to-end under 2 hours for 100 documents.
- DATA ISOLATION TEST: Create two separate test matters (TEST-DD-001 and TEST-DD-002) with different document packages. Run DD analysis on both. Verify: (1) vector search for matter 001 does not return results from matter 002, (2) the review portal for matter 001 does not display findings from matter 002, (3) Clio entries are posted to the correct respective matters. Expected: Complete data isolation between matters.
- DISASTER RECOVERY TEST: Simulate failure scenarios: (1) Azure OpenAI rate limit exceeded — verify graceful retry with exponential backoff, (2) Pinecone timeout — verify partial results are preserved and agent can resume, (3) iManage connection failure — verify documents are cached locally and retried, (4) Mid-analysis application restart — verify state is persisted and analysis can resume from last checkpoint. Expected: All failure scenarios handled gracefully without data loss.
- END-TO-END USER ACCEPTANCE TEST: Have a participating attorney run a complete DD workflow on a real (but low-stakes) matter: upload documents to iManage, trigger DD review from the portal, receive Teams notification, review flagged findings, approve/modify findings, generate final report, verify Clio time entry. Collect attorney feedback on: report quality, finding accuracy, ease of use, time saved vs. manual review. Expected: Attorney confirms the system is usable and findings are directionally accurate, with specific feedback documented for iteration.
Client Handoff
The client handoff meeting should be scheduled as a 2-hour session with the managing partner, lead M&A attorney, IT administrator (if any), and all attorneys who will use the system. Cover the following topics:
Documentation to Leave Behind
- System Architecture Diagram (PDF)
- Attorney Quick-Start Guide (laminated desk reference)
- AI Usage Policy (firm-customized, signed by all attorneys)
- Updated Engagement Letter Template with AI Disclosure
- Vendor DPA Summary Sheet (what each vendor can/cannot do with data)
- Troubleshooting Runbook with screenshots
- MSP Support Contact Card with escalation tiers
- Spellbook Quick Reference Card
- Copilot Tips Sheet for Legal Professionals
- 90-Day Adoption Roadmap with milestones
Maintenance
ONGOING MSP MAINTENANCE RESPONSIBILITIES:
1. Weekly (30 min/week)
2. Monthly (2 hrs/month)
3. Quarterly (4 hrs/quarter)
4. Annually
SLA Considerations
- Response time for system outages: 1 hour during business hours, 4 hours after hours
- Response time for non-critical issues: 4 business hours
- Scheduled maintenance window: Sundays 2:00–6:00 AM local time
- Uptime target: 99.5% during business hours (Mon–Sat 7 AM–9 PM)
- Maximum acceptable DD processing time: 3 hours for packages under 200 documents
Model Retraining / Prompt Update Triggers
- Attorney feedback indicates >20% false positive rate on risk findings
- New regulation or ABA opinion affects DD analysis requirements
- Firm begins handling a new transaction type (e.g., healthcare M&A requiring HIPAA analysis)
- Azure OpenAI releases a new model version with significant capability improvements
- Spellbook or Clio releases major feature updates affecting integration points
Escalation Path
- Tier 1 (MSP Help Desk): Password resets, Spellbook/Copilot basic issues, scanner problems
- Tier 2 (MSP Cloud Engineer): API integration failures, Azure service issues, Pinecone connectivity
- Tier 3 (MSP AI Specialist): Agent behavior issues, prompt engineering, report quality concerns
- Tier 4 (Vendor Support): Platform-specific bugs — contact Harvey/Spellbook/Clio/Microsoft directly
- Emergency: Managing partner contacts MSP account manager directly for deal-critical system failures
Alternatives
Harvey AI Agent Builder (Enterprise Turnkey)
Instead of building a custom DD agent on Azure OpenAI + LangChain, deploy Harvey AI's Agent Builder platform. Harvey provides a fully managed legal AI environment where attorneys can create custom DD workflows without code. Harvey's platform already has legal-specific training, built-in compliance features, and handles all LLM infrastructure. The MSP's role shifts from building the AI to managing the Harvey deployment, integration, and training.
Kira by Litera (ML-Based Extraction)
Replace the custom GPT-5.4 agent with Kira's established ML-based contract review platform. Kira uses purpose-trained machine learning models (not general LLMs) for clause extraction and has 18+ years of training data from top global law firms. The new Kira experience includes generative AI capabilities at no additional cost. Integrates natively with Litera's document management and transaction management tools.
Microsoft Copilot Studio Custom Agents (Low-Code)
Instead of the Python-based custom agent, build the DD orchestration workflow using Microsoft Copilot Studio's visual agent builder. This creates agents that run within Microsoft Teams and can be triggered by attorneys directly from their collaboration environment. Uses Azure OpenAI under the hood but with a low-code configuration interface. Agents can call external APIs (iManage, Clio, Pinecone) via custom connectors.
Recommend this for firms wanting a lighter DD capability (e.g., contract review for non-M&A transactions) or as a Phase 1 proof of concept before investing in the full custom agent.
Self-Hosted Open Source LLM (On-Premises)
For firms with extreme confidentiality requirements (e.g., national security matters, pre-announcement M&A for public companies), deploy an open-source LLM (DeepSeek-R1 or Qwen3-235B) on on-premises GPU servers. No data ever leaves the firm's network. The MSP procures, installs, and maintains the GPU infrastructure and manages model updates.
Recommend this ONLY for firms with documented regulatory or client requirements prohibiting cloud AI, and only after confirming the firm's facilities can support the power and cooling requirements.
Emma Legal + VDR Integration (M&A Specialist)
For firms focused specifically on M&A due diligence (rather than general contract review), deploy Emma Legal as the primary AI platform integrated directly with Intralinks or Datasite virtual data rooms. Emma is purpose-built for M&A DD and provides pre-configured workflows for analyzing deal documents, flagging clause-level risks, and generating structured DD reports that can be shared with counterparties.
Recommend this for boutique M&A firms that do high-volume deal work and want the fastest, most focused solution with lower cost of entry.
Want early access to the full toolkit?