Skip to main content

Extraction Models

Automatic data extraction from documents.

Overview​

Extraction models allow you to automatically extract structured data from documents (PDF, images, emails) to fill form fields.

Technologies​

  • OCR: Optical character recognition
  • Pattern matching: Regular expressions
  • AI/ML: Machine learning (advanced versions)
  • Zones: Extraction by coordinates

Extraction Types​

Zone-based Extraction​

Define rectangles on the document:

Zone 1: x=100, y=200, w=300, h=50 β†’ Invoice number
Zone 2: x=100, y=300, w=300, h=50 β†’ Date
Zone 3: x=100, y=400, w=300, h=50 β†’ Amount

Pattern-based Extraction​

Use regular expressions:

# Invoice number
Facture\s*NΒ°\s*:\s*([A-Z0-9-]+)

# Date
(\d{2})/(\d{2})/(\d{4})

# Amount
(\d+[,.]\d{2})\s*€

Keyword-based Extraction​

Search for values after keywords:

Keyword: "Total HT :"
Value: Number following the keyword

Configuration​

Create an Extraction Model​

  1. Open Extraction in Process Studio
  2. New model
  3. Load a sample document
  4. Define extraction zones
  5. Associate with form fields
  6. Test and validate

Define a Zone​

var extractionZone = new ExtractionZone
{
Name = "Numero_Facture",
X = 100,
Y = 200,
Width = 300,
Height = 50,
FieldName = "NumeroFacture",
Pattern = @"[A-Z0-9-]+"
};

Usage​

Manual Extraction​

  1. Open a document
  2. Select Extract data
  3. Choose the extraction model
  4. Fields are filled automatically
  5. Verify and correct if necessary

Automatic Extraction​

When importing a document:

// Automatic document type detection
var documentType = DetectDocumentType(uploadedFile);

// Apply extraction model
var extractedData = ApplyExtractionModel(documentType, uploadedFile);

// Fill form fields
FillFormFields(extractedData);

Use Cases​

Invoices​

Automatically extract:

  • Invoice number
  • Issue date
  • Supplier
  • Amount excluding tax, VAT, total including tax
  • Invoice lines (table)

Business Cards​

Extract:

  • Last name, First name
  • Company
  • Email
  • Phone
  • Address

Contracts​

Extract:

  • Contracting parties
  • Start/end dates
  • Contract amount
  • Specific clauses

Best Practices​

  • Standardized documents: Best results
  • OCR quality: Clear and high-contrast documents
  • Validation: Always verify extracted data
  • Learning: Improve models over time
  • Fallback: Allow manual entry if extraction fails

References​