Nous chargeons les différents jeux de données spatiaux provenant de sources hétérogènes (ANGAP, SAPM, WDPA, Vahatra) ainsi que les données légales extraites manuellement. Ils seront harmonisés et fusionnés dans un jeu unique de référence.
Le champ CLASSEMENT contient des références légales souvent partielles ou hétérogènes. Une extraction manuelle a permis d’en dériver les champs date_texte, type_texte, num_texte (catégories existantes dans les données CNLEGIS), et une année de statut STATUS_YR. Ces champs sont intégrés au jeu de données harmonisé.
5.2 Harmonisation des autres sources (SAPM, Vahatra)
Les autres jeux de données sont également harmonisés avec un sous-ensemble de variables cibles. On applique des transformations simples et un renommage standardisé pour assurer l’interopérabilité.
On utilise une fonction pa_match() pour identifier les aires protégées correspondantes dans la WDPA, sur la base de la proximité spatiale et du taux de recouvrement géométrique.
Code
# Match protected areas from x to y based on centroid distance and spatial overlap.# Returns the original x with best matches from y, using overlap thresholds and ranking.# Adds matched name, ID, and overlap percentage to x.pa_match <-function(x, y,pa_name_x ="NAME",pa_name_y ="NAME",pa_id_y ="WDPAID",threshold_strong_match =0.35,threshold_weak_match =0.1) {# Harmonize CRS and fix geometry y <-st_transform(y, st_crs(x)) %>%st_make_valid()# Prepare x and y with centroids x_aug <- x %>%mutate(index_x =row_number(),name_x = .data[[pa_name_x]],centroid_x =st_centroid(geometry)) y_aug <- y %>%mutate(index_y =row_number(),name_y = .data[[pa_name_y]],id_y = .data[[pa_id_y]],centroid_y =st_centroid(geometry))# Find 3 closest y for each x x_matches <- x_aug %>%rowwise() %>%mutate(closest_y =list( y_aug %>%st_drop_geometry() %>%mutate(dist =as.numeric(st_distance(centroid_x, centroid_y))) %>%arrange(dist) %>%slice_head(n =2) %>%select(index_y, dist) ) ) %>%unnest(closest_y)# Compute spatial overlaps matches_with_overlap <- x_matches %>%rowwise() %>%mutate(overlap =tryCatch({ inter <-st_intersection(geometry, y_aug$geometry[index_y])if (length(inter) >0) {round(as.numeric(st_area(inter) /st_area(geometry)), 3) } else {0 } }, error =function(e) 0) ) %>%ungroup() %>%mutate(name_y = y_aug[[pa_name_y]][index_y],id_y = y_aug[[pa_id_y]][index_y] ) %>%st_drop_geometry()# Rank and apply selection logic best_matches <- matches_with_overlap %>%group_by(index_x) %>%arrange(desc(overlap), dist) %>%mutate(rank =row_number()) %>%slice(1:2) %>%summarise(index_x =first(index_x),name_y =case_when(n() >=2&& overlap[2] > threshold_strong_match && overlap[2] > overlap[1] ~ name_y[2], overlap[1] < threshold_weak_match ~NA_character_,TRUE~ name_y[1] ),id_y =case_when(n() >=2&& overlap[2] > threshold_strong_match && overlap[2] > overlap[1] ~ id_y[2], overlap[1] < threshold_weak_match ~NA_integer_,TRUE~ id_y[1] ),overlap_y =case_when(n() >=2&& overlap[2] > threshold_strong_match && overlap[2] > overlap[1] ~ overlap[2], overlap[1] < threshold_weak_match ~NA_real_,TRUE~ overlap[1] ),.groups ="drop" )# Join results back to x output <- x_aug %>%left_join(best_matches, by ="index_x") %>%select(-index_x, -name_x, -centroid_x)return(output)}all_PAs_matched <-pa_match(all_PAs, wdpa_mdg_2025)all_PAs_unmatched <- all_PAs_matched %>%filter(is.na(name_y))words_to_exlcude <-c("tsingy de","corridor entre parcelles i et ii d'","corridor", "extension","for[eê]t d['’]?", "aire protégée( d['’]?)?","androka","Maromena")# Enhanced fallback matching function for protected areaspa_match_fallback_name <-function(x, y,pa_name_x ="NAME",pa_name_y ="NAME",pa_id_y ="WDPAID", remove_terms,min_stringdist =0.08) {# Clean and normalize names clean_name <-function(s) { pattern <-str_c(remove_terms, collapse ="|") s %>%str_to_lower() %>%str_replace_all(pattern, "") %>%str_squish() }# Create index for reinsertion x <- x %>%mutate(row_index =row_number())# Extract unmatched rows x_unmatched <- x %>%filter(is.na(name_y)) %>%mutate(name_x_raw = .data[[pa_name_x]],name_x_clean =clean_name(name_x_raw)) %>%distinct(name_x_clean, .keep_all =TRUE) %>%mutate(index_x =row_number())# Prepare y y_tbl <- y %>%st_drop_geometry() %>%mutate(name_y_raw = .data[[pa_name_y]],name_y_clean =clean_name(name_y_raw),id_y = .data[[pa_id_y]],index_y =row_number())# String distance matrix distance_matrix <-stringdistmatrix(x_unmatched$name_x_clean, y_tbl$name_y_clean,method ="jw", p =0.1)# Get best match per unmatched row closest_matches <-tibble(index_x =rep(x_unmatched$index_x, times =nrow(y_tbl)),index_y =rep(y_tbl$index_y, each =nrow(x_unmatched)),dist =as.vector(distance_matrix) ) %>%group_by(index_x) %>%slice_min(dist, n =1, with_ties =FALSE) %>%ungroup()# Join metadata matched <- closest_matches %>%left_join(x_unmatched, by ="index_x") %>%left_join(y_tbl, by ="index_y") %>%filter(dist < min_stringdist)# Compute overlap with matched geometry matched <- matched %>%rowwise() %>%mutate(overlap_y =tryCatch({ inter <-st_intersection(geometry, y$geometry[index_y])if (length(inter) >0) round(as.numeric(st_area(inter) /st_area(geometry)), 3) else0 }, error =function(e) NA_real_)) %>%ungroup()# Replace original rows x_updated <- x %>%left_join( matched %>%select(row_index, name_y = name_y_raw, id_y = id_y.y, overlap_y, stringdist = dist),by ="row_index" ) %>%mutate(name_y =coalesce(name_y.y, name_y.x),id_y =coalesce(id_y.y, id_y.x),overlap_y =coalesce(overlap_y.y, overlap_y.x),stringdist =if ("stringdist.x"%in%names(.)) coalesce(stringdist, stringdist.x) else stringdist ) %>%select(-row_index, -ends_with(".x"), -ends_with(".y"), -stringdist)return(x_updated)} # Perform spatial matchingall_PAs_matched <-pa_match_fallback_name(x = all_PAs_matched, y = wdpa_mdg_2025,remove_terms = words_to_exlcude) %>%rename(WDPA_NAME = name_y, WDPAID = id_y, overlap_WPDA = overlap_y)
5.4 Ajout des données légales issues des textes officiels
On ajoute les aires protégées mentionnées dans les décrets à travers le jeu legal contenant les données CNLEGIS. Ces données ne comportent pas de géométrie, aussi on en ajoute une vide afin de pouvoir les consolider dans la même table.
Code
# Format to be compatible with the spatial datalegal_conso <- legal %>%mutate(NAME = ap_nom_texte,dataset_id ="CNLEGIS_2024",date_texte =as.character(date_texte),geometry =st_sfc(rep(st_geometrycollection(), n()), crs =4326)) |>select(NAME, dataset_id, date_texte, type_texte, num_texte, type_decision, WDPA_NAME, WDPAID, geometry) %>%st_as_sf()all_PAs_matched <- all_PAs_matched %>%st_transform(st_crs(legal_conso)) %>%bind_rows(legal_conso)
5.5 Export final
On ajoute enfin les données WDPA de mars 2025. Le jeu consolidé est exporté dans des formats interopérables (.parquet et .rds).