Extraction Models
Automatic data extraction from documents.
Overviewβ
Extraction models allow you to automatically extract structured data from documents (PDF, images, emails) to fill form fields.
Technologiesβ
- OCR: Optical character recognition
- Pattern matching: Regular expressions
- AI/ML: Machine learning (advanced versions)
- Zones: Extraction by coordinates
Extraction Typesβ
Zone-based Extractionβ
Define rectangles on the document:
Zone 1: x=100, y=200, w=300, h=50 β Invoice number
Zone 2: x=100, y=300, w=300, h=50 β Date
Zone 3: x=100, y=400, w=300, h=50 β Amount
Pattern-based Extractionβ
Use regular expressions:
# Invoice number
Facture\s*NΒ°\s*:\s*([A-Z0-9-]+)
# Date
(\d{2})/(\d{2})/(\d{4})
# Amount
(\d+[,.]\d{2})\s*β¬
Keyword-based Extractionβ
Search for values after keywords:
Keyword: "Total HT :"
Value: Number following the keyword
Configurationβ
Create an Extraction Modelβ
- Open Extraction in Process Studio
- New model
- Load a sample document
- Define extraction zones
- Associate with form fields
- Test and validate
Define a Zoneβ
var extractionZone = new ExtractionZone
{
Name = "Numero_Facture",
X = 100,
Y = 200,
Width = 300,
Height = 50,
FieldName = "NumeroFacture",
Pattern = @"[A-Z0-9-]+"
};
Usageβ
Manual Extractionβ
- Open a document
- Select Extract data
- Choose the extraction model
- Fields are filled automatically
- Verify and correct if necessary
Automatic Extractionβ
When importing a document:
// Automatic document type detection
var documentType = DetectDocumentType(uploadedFile);
// Apply extraction model
var extractedData = ApplyExtractionModel(documentType, uploadedFile);
// Fill form fields
FillFormFields(extractedData);
Use Casesβ
Invoicesβ
Automatically extract:
- Invoice number
- Issue date
- Supplier
- Amount excluding tax, VAT, total including tax
- Invoice lines (table)
Business Cardsβ
Extract:
- Last name, First name
- Company
- Phone
- Address
Contractsβ
Extract:
- Contracting parties
- Start/end dates
- Contract amount
- Specific clauses
Best Practicesβ
- Standardized documents: Best results
- OCR quality: Clear and high-contrast documents
- Validation: Always verify extracted data
- Learning: Improve models over time
- Fallback: Allow manual entry if extraction fails