Skip to main content
Voseno
Back to Blog
Engineering
8 min read

How We Deduplicate Property Listings Across 1,400 Sources

The same Dutch property appears on Funda, Pararius, broker sites, and aggregators simultaneously. Here is our three-signal approach to deduplication: BAG ID resolution, fuzzy address normalization, and feature fingerprinting.

When a Dutch broker lists a property, it typically appears on multiple portals within hours. A rental apartment in Amsterdam might show up on Funda, Pararius, the broker's own website, and two or three aggregator sites, each presenting the data slightly differently. For our Aggregator API, which monitors over 1,400 sources, detecting that these are all the same property is the core technical challenge.

Why simple matching fails

The naive approach is exact string matching on the address. This breaks immediately in practice. One source lists 'Keizersgracht 100-3' while another shows 'Keizersgracht 100 III' and a third uses 'Keizersgracht 100 3 hoog.' These are all the same apartment. Dutch address conventions include numeric additions, letter additions, and descriptive additions (souterrain, boven, 3-hoog), and there is no standard format across portals.

Prices also vary across sources. One portal might show the monthly rent including service charges, another excluding them. Surface areas get rounded differently. Even property types are inconsistent: one source calls it an 'appartement,' another a 'bovenwoning.'

Signal 1: BAG ID resolution

The strongest deduplication signal is the BAG verblijfsobject ID. Every Dutch address maps to a unique 16-digit identifier in the Basisregistratie Adressen en Gebouwen. When we can resolve a listing's address to its BAG ID via the PDOK Locatieserver, we have a definitive match. Two listings with the same BAG ID and listing type (sale or rental) are the same property.

This works for roughly 70% of incoming listings. It fails when the listing address is incomplete (just a street name and city, no house number), when the address includes a non-standard addition that PDOK cannot resolve, or when the property is too new to be in BAG yet.

Signal 2: fuzzy address normalization

For listings where BAG ID resolution fails, we normalize the address using Dutch-specific rules and then compare using edit distance.

python
import re
import unicodedata
from Levenshtein import distance

ABBREVIATIONS = {
    r"\bstr\.?\b": "straat",
    r"\bln\.?\b": "laan",
    r"\bpl\.?\b": "plein",
    r"\bgr\.?\b": "gracht",
    r"\bwg\.?\b": "weg",
}

def normalize_dutch_address(raw: str) -> str:
    """Normalize a Dutch address for comparison."""
    addr = raw.lower().strip()
    # Expand common abbreviations
    for pattern, replacement in ABBREVIATIONS.items():
        addr = re.sub(pattern, replacement, addr)
    # Separate house number additions
    addr = re.sub(r"(\d+)\s*-\s*([a-z])", r"\1 \2", addr)
    # Normalize Roman numerals in additions
    addr = re.sub(r"\biii\b", "3", addr)
    addr = re.sub(r"\bii\b", "2", addr)
    addr = re.sub(r"\bi\b", "1", addr)
    # Normalize hoog/boven/etage additions
    addr = re.sub(r"\bhoog\b", "h", addr)
    addr = re.sub(r"\bboven\b", "h", addr)
    # Strip diacritics
    addr = unicodedata.normalize("NFKD", addr)
    addr = addr.encode("ascii", "ignore").decode("ascii")
    return re.sub(r"\s+", " ", addr).strip()

def addresses_match(a: str, b: str, threshold: int = 2) -> bool:
    """Check if two addresses refer to the same location."""
    return distance(
        normalize_dutch_address(a),
        normalize_dutch_address(b)
    ) <= threshold

The normalization handles the most common Dutch address variations: abbreviations (str. to straat, ln. to laan), house number addition separators (100-A vs 100 A), Roman numerals (III to 3), and descriptive floor indicators (3-hoog, boven). After normalization, we apply Levenshtein distance with a threshold of 2 edits.

Signal 3: feature fingerprinting

When address matching is ambiguous, particularly in multi-unit buildings where listings sometimes omit the specific unit number, we use a feature fingerprint: the combination of surface area (within a 3 m² tolerance), price bucket, and room count. In dense Dutch housing markets, this tuple is nearly unique per listing within a building.

For example, if two listings both say 'Prinsengracht 200, Amsterdam' without a unit number, but one is 65 m², 3 rooms, €1,850/month and the other is 42 m², 2 rooms, €1,450/month, they are clearly different apartments in the same building. If the features match within tolerance, they are almost certainly the same listing from different sources.

Hard cases

Several patterns remain genuinely difficult. VvE (Vereniging van Eigenaars) addresses sometimes use the building address rather than the individual unit address. Newly built properties may not yet appear in BAG. Temporary listings that appear briefly on one source and then get relisted with different details on another can evade time-windowed matching.

Multi-unit new construction is particularly tricky. A developer might list '12 apartments at Nieuwbouwproject De Werf' on one portal, while individual units appear on broker sites with provisional addresses. Until BAG registration catches up, these require manual review.

Info

Our policy: when confidence is below 0.70, we treat listings as distinct. We would rather show a mild duplicate than incorrectly merge two separate properties.

Measuring accuracy

We maintain a hand-labelled ground-truth dataset of 12,000 listing pairs (6,000 confirmed duplicates and 6,000 confirmed distinct properties), assembled over six months of manual review. Against this dataset, our pipeline achieves 99.1% precision and 98.8% recall. We continuously expand this dataset as we encounter new edge cases.