3  Georeferencing

This section provides details about the georeferencing process applied to household panel data collected by the Rural Observatory System (ROS) from 1995 to 2014. The ROS gathered data on various rural contexts in Madagascar. While the data quality was carefully monitored, its geolocation attributes were not fully developed, limiting its use in spatially integrated research and public policy evaluation. This appendix aligns the ROS dataset’s various spatial designations with a standardized geographic database, ensuring consistent correlation between each ROS observation and the corresponding geospatial entities, namely communes and fokontany. Similar to other sections of this document, the source code developed to perform all the data processing is accessible by expanding the code blocks, as shown below.

Code
library(tidyverse)    # A series of packages for data manipulation
library(haven)        # Required for reading STATA files (.dta)
library(labelled)     # To work with labelled data from STATA
library(sf)           # for spatial data handling
library(stringdist)   # for string distance and matching
library(tmap)         # for mapping
library(fuzzyjoin)    # for fuzzy joining
library(readxl)       # Read data frames to Excel format
library(writexl)      # Write data frames to Excel format
library(gt)           # for nicely formatted tables
library(cowplot)      # to combine plots
library(gtsummary)    # to produce nice summary tables
library(janitor)      # to simply add rowsums

# Obtain all years from the directory structure
ros_data_loc <- "data/dta_format/"
years <- list.dirs(ros_data_loc, recursive = FALSE, full.names = FALSE)

3.1 Shared geographic reference system

The main purpose of this section is to enhance the georeferencing of the ROS survey data for open data sharing. The initial ROS survey, initiated in 1995, recorded geographical information in varying formats: from “village” to a combination of “municipality”, “village”, and “site”. A significant challenge was that data collection started in 1995, while in Madagascar municipalities were only formally established in 1994, and it took several years required for stabilization. The toponyms, which mostly came from oral traditions, were subjected to change, resulting in varied written representations. Our goal is to identify, disambiguate, and georeference the observations recorded in the ROS data, adopting the Common Operational Datasets (CODs) as a reference, which have been collaboratively defined by OCHA and Madagascar’s BNGRC (National Disaster Management Office).

3.1.1 The Common Operational Dataset

CODs are the foundation for all preparedness and response operations, particularly within the humanitarian sector. They were adopted by the IASC in 2008 and revised in 2010, and they play a crucial role in facilitating informed decision-making during the critical initial hours of a crisis. By ensuring consistency among stakeholders, CODs simplify data management and establish a shared operational picture of a crisis. Of particular relevance to our purpose is the inclusion of P-codes in CODs. These unique geographic identification codes can be found in both Administrative Boundary CODs (COD-ABs) and Population Statistics CODs (COD-PSs), and they help overcome challenges posed by variations in place names and spellings. For example, in Madagascar, there are 81 different administrative level 4 (ADM4) features labeled “Morafeno,” with six of them existing within ADM3 features also called “Morafeno.” The only way to distinguish them is through their unique ADM2 features.

P-codes serve as reliable geographic identifiers, eliminating errors that may arise from identical or differently spelled geographic locations. Using the HDX platform, an open platform for cross-crisis data sharing, we retrieve this data to ensure the accurate georeferencing of our ROS data. By leveraging the standardized and official spelling of places provided by P-codes, we can combine, harmonize, and analyze data from various sources, thereby offering a comprehensive and geographically accurate view of the survey’s findings.

Although some level of harmonization is achieved, especially regarding certain variables and household identifiers, the data varies in terms of geographical granularity. In the early years, the data primarily includes a single field indicating the village name. As the years go by, this evolves to include a municipality name, and in later years, an additional “site” name occasionally appears. A comprehensive overview of the observations can be obtained from the “res_deb.dta” files for each year.

3.1.2 Administrative boundaries

The “Madagascar Subnational Administrative Boundaries” dataset is sourced from the Common Operational Datasets (CODs), which provide authoritative reference datasets for decision-making during humanitarian operations. CODs are specifically designed to streamline the discovery and exchange of essential data, ensuring uniformity and using the ‘best available’ datasets. This particular dataset focuses on administrative boundaries, including gazetteers with P-codes, which facilitate organized humanitarian assessments and data management. P-codes act as unique identifiers for every administrative unit and populated area, ensuring standardization in nomenclature. When datasets adhere to the P-code standard, their integration and analysis become more efficient. The dataset provides comprehensive boundary information for Madagascar at five administrative levels: country, region, district, commune, and fokontany. It is accessible for download as shapefiles from the provided link.

3.1.3 Localities

The “Madagascar Populated Places” dataset is also part of the Common Operational Datasets (CODs). This dataset includes populated place points for Madagascar. The data has been sourced from the National Geospatial-Intelligence Agency and provided by the University of Georgia - ITOS. Furthermore, the Geographic Information Support Team (GIST) has taken on the role of distributor, with the data being published on 2007-03-07. UN OCHA ROSA has enhanced the dataset by adding P-codes and administrative boundary names based on the BNGRC (National Disaster Management Office) data. The dataset geolocates 28,184 populated places with their toponyms (names), codes related to various administrative levels such as fokontany, commune, district, and region, as well as their spatial coordinates.

Code
# Load datasets
ROS_surveys_2007 <- read_dta(paste0(ros_data_loc, "2007/res_deb.dta"))
observatory_names <- read_xlsx("references/observatory_names.xlsx")
obs_communes <- st_read(
  "data/Spatial_data/Observatoires_ROS_communes_COD_v3.gpkg",
  quiet = TRUE) %>%
  left_join(select(observatory_names, name, code), 
            by = c("OBS_NAME" = "name")) %>%
  rename(OBS_CODE = code)
pop_places <- st_read(
  "data/Spatial_data/OCHA_BNGRC populated places/mdg_pplp_places_NGA_OCHA.shp", quiet = TRUE)

3.2 Georeferencing methodology

3.2.1 Simplify strings

The treatment of toponyms presents a unique challenge, especially when these names are captured from varied sources. In the dataset, these names can vary due to differences in languages, case sensitivity, and the inclusion of additional descriptive terms. To address this, the clean_string function was developed. This function begins by converting all strings to lowercase, ensuring that subsequent comparisons are not sensitive to case variations. Next, to create a uniform standard, all non-alphanumeric characters are removed, retaining only spaces and the alphanumeric content. Common qualifiers in toponyms, such as “centre”, “haut” (high) or “bas” (low), which are not consistently used across records, are also removed. Given the bilingual nature of the dataset, with entries potentially in both Malagasy and French, the function translates cardinal points to the Malagasy language to ensure uniformity. Lastly, certain locales with multiple names, such as “Fort Dauphin”, also known as “Taolagnaro”, “Tolagnaro” or “Tolanaro”, are standardized to a single term, “Tolanaro”, to eliminate potential disparities. We also address instances of Roman numerals from I to VI, converting them to their Arabic numeral counterparts, ensuring consistent representation across records.

Code
clean_string <- function(x){
  x %>%
    tolower() %>% # Convert to lowercase
    # Retain spaces, remove other non-alphanumeric characters
    str_replace_all("[^[:alnum:][:space:]]", " ") %>% 
    str_remove_all("\\b(centre|haut|bas|androy)\\b") %>%
    str_trim() %>% # Trim spaces from start and end of string
    str_replace_all("\\bcentre\\b", "") %>% # Remove the word 'centre'
    # Translate cardinal points
    str_replace_all("\\bnord\\b", "avaratra") %>% 
    str_replace_all("\\best\\b", "atsinanana") %>%
    str_replace_all("\\bouest\\b", "andrefana") %>% 
    str_replace_all("\\bsud\\b", "atsimo") %>% 
    str_replace_all("\\batsinana\\b", "atsinanana") %>% # Replace short form 
    str_replace_all("(fort dauphin)|(taolagnaro)|(tolagnaro)", 
                    "tolanaro") %>% # Variations for fort dauphin
    # Convert Roman numerals to Arabic
    str_replace_all("\\bi\\b", "1") %>% 
    str_replace_all("\\bii\\b", "2") %>%
    str_replace_all("\\biii\\b", "3") %>% 
    str_replace_all("\\biv\\b", "4") %>% 
    str_replace_all("\\bv\\b", "5") %>% 
    str_replace_all("\\bvi\\b", "6")
}

3.2.2 Fuzzy matching

By default, statistical softwares and computing languages match text by pairing only identical strings. Exact string matching is inappropriate in our context, where location data entry was subject to human errors like typographical mistakes or minor variations in spelling. To avoid this rigidity, fuzzy matching is employed. This approach gauges the degree of similarity between two strings, bypassing the need for an exact character-to-character match. The principle metric adopted for this is the “Levenshtein distance,” which quantifies the minimum number of single-character edits required to change one string into another. The fuzzy_match function encapsulates this approach. The function initiates the process by filtering the reference list of encontered toponyms based on a given observatory code, which considerably narrows down potential matches. Then, using the Jaro-Winkler distance metric — a variant of the Levenshtein distance particularly suited for shorter strings — the function computes the similarity between the target string and entries in the filtered reference. To ensure that only relevant matches are acknowledged, a threshold, termed max_distance, is set. Matches that exceed this threshold are disregarded. For those that pass this validation, the function then extracts the pertinent details of the matched row from the reference dataframe.

Code
fuzzy_match <- function(target_string, dataframe, column_name, observatory_code, 
                        max_distance = 0.25) {
  # Filter the dataframe based on observatory_code
  filtered_reference <- dataframe %>%
    filter(OBS_CODE == observatory_code) %>%
    select(all_of(column_name), ADM3_PCODE, ADM3_EN)
  
  # If filtered_reference is empty, return NA values
  if (nrow(filtered_reference) == 0) {
    return(list(matched_string = NA, ADM3_PCODE = NA, ADM3_EN = NA, distance = 1))
  }
  
  # Use stringdist to find the closest match
  distances <- stringdist::stringdistmatrix(target_string, filtered_reference[[column_name]], method = "jw")
  
  # If there are no valid distances, set min_distance to Inf
  if(all(is.na(distances))) {
    min_distance <- Inf
  } else {
    min_distance <- min(distances, na.rm = TRUE)  # Ensure NA values don't affect the min calculation
  }
  
  # Check for Inf distance and replace it with 1
  if (is.infinite(min_distance)) {
    min_distance <- 1
  }
  
  # If min_distance exceeds the max_distance threshold, return NA values
  if (min_distance > max_distance) {
    return(list(matched_string = NA, ADM3_PCODE = NA, ADM3_EN = NA, distance = NA))
  }
  
  matched_row <- filtered_reference[which.min(distances), ]
  
  return(list(matched_string = matched_row[[column_name]], 
              ADM3_PCODE = matched_row$ADM3_PCODE, 
              ADM3_EN = matched_row$ADM3_EN, 
              distance = min_distance))
}

3.2.3 Hierarchical matching of data

The hierarchical organization of spatial entities is key for our challenge. Such an organization allows for a cascading representation of data, starting from broader scopes and narrowing down to more specific layers. This representation is reminiscent of the defining order of geographical entities: regions contain provinces, which contain municipalities, and these in turn contain localities. In the georeferencing context, leveraging this hierarchical structure can lead to more precise matches. For instance, if an observatory code is associated with a specific district, the search for matches is confined to that district, enhancing both the efficiency and accuracy of the process. Figure 3.1 represents this hierarchical arrangement, serving as a roadmap for the subsequent data matching tasks.

graph TD

A[Legend]

W[ ]
X[ ]
Y[ ]
Z[ ]

W -->|Pre-defined hierarchy| X
Y -.-|Implemented matching|Z
graph TD

%% Administrative Logic
subgraph "Administrative Logic"
A[Region]
B[District]
C[Commune]
D[Fokontany]
I[Populated places]

A --> B
B --> C
C --> D
D --> I
end

%% Observatory Logic

subgraph "Observatory Logic"
E[Observatory system]
F[Observatory]
G[Commune]
H[Village]
J[Site]

E --> F
F --> G
G --> H
F --> H
H--> J
end

%% Geospatial matching
H -.- I
C -.- G
Figure 3.1: Spatial entities pre-defined relationship and matching

Madagascar’s current administrative setup is straightforward: regions contain districts; districts have communes; and communes are made up of Fokontany. Though a Fokontany should in principle be a single village, it often includes multiple villages or populated areas. It’s worth noting that while the idea of communes has been around for a while, they were only officially recognized in 1994. However, rolling them out took some time after 1994. Before 1994, “communes” simply described local government areas without any formal administrative status. The regions were created in 2004. On the ROS side, observatory referred during the first surveys to villages. A systematic registry of the communes only started in the 2004 and 2005 rounds, depending on the observatories. A mention to “sites” also appeared in 2011 but was scarcely documented. Our strategy was to established links between the village information and populated places, and between communes.

3.3 A Detailed walk-through

We now break down our approach to describe each subsequent step. We began by exploring the ROS documentation, and in particular the reports associated to community survey, that contain descriptions of the area surveyed by each observatory. While doing so, we updated the COD subnational administrative boundary dataset by adding a field named OBS_Y_N. This field was marked as ‘1’ if the municipality was listed in an observatory survey; if not, it was left empty. Additionally, we added another field, OBS_NUM, which would store the observatory number. If the municipality wasn’t part of any survey, this field was left empty. The list of observatories and surveyed municipalities can be found in the ?tbl-list-muni.

Next, we moved on to geolocation, which took place in four stages. Each stage depended heavily on the data quality and completeness obtained from the previous ones.

3.3.1 Method 1

This phase involved a systematic alignment process for the municipalities. When municipality names were included in our dataset, we attempted to correlate them with names of municipalities identified as data collection locations. To ensure accuracy, this alignment was carried out separately for each observatory. Due to possible differences in terminology across sources, a fuzzy matching algorithm was used. This method calculates the likelihood of two different names referring to the same entity. Once a probable match was identified, we visually verified all matches and flagged any false positives for removal.

The following code segment details the methodology applied:

Code
# List of years
years <- 1995:2014

# Read all datasets and combine
all_surveys_description <- map_df(years, function(year) {
  df <- read_dta(paste0(ros_data_loc, year, "/res_deb.dta"))
    # Convert all columns to character to ensure consistency
  df <- df %>% mutate_all(as.character)
  return(df)
})

# Extract unique combinations and list all the years they appeared in
unique_combinations <- all_surveys_description %>%
  group_by(j0, j42, j4) %>%
  summarize(years = toString(unique(year)),
            obs_count = n()) %>%
  ungroup()

# Harmonize the fields that contain municipality or village names
unique_combinations <- unique_combinations %>%
  mutate(clean_muni = clean_string(j42),
         clean_village = clean_string(j4))
obs_communes <- obs_communes %>%
  mutate(clean_ADM3 = clean_string(ADM3_EN))
pop_places <- pop_places %>%
  mutate(clean_pname = clean_string(PLACE_NAME)) 

# List of observatories for which municipalities have been identified
identified_observatories <- unique(obs_communes$OBS_CODE) %>%
  na.omit()
# Filter for the observatory for which we already have a manual identification
# of municipalities
unique_combinations <- unique_combinations %>%
  filter(j0 %in% identified_observatories) 

# Apply the fuzzy matching observatory-wise
results <- map2_df(unique_combinations$clean_muni, 
                   unique_combinations$j0, 
                   ~ as.data.frame(t(
                     fuzzy_match(.x, obs_communes, "clean_ADM3", .y))))  %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Combine the results with the unique_combinations
correspondence_table <- bind_cols(unique_combinations, results) 

# Add a column for the matching method
correspondence_table <- correspondence_table %>%
  mutate(method = ifelse(!is.na(matched_string), "method_1", NA_character_))

To synthesize and visually represent our matches, the identified municipalities are plotted on a map:

Code
# Extract matched resuts
matched_results <- unique(correspondence_table$ADM3_PCODE) %>%
  na.omit()
# Keep municipalities in those
matched_spatial <- obs_communes %>%
  filter(ADM3_PCODE %in% matched_results)

tmap_mode("view")
# Plot
tm_shape(matched_spatial) +
  tm_polygons(col = "OBS_NAME")

3.3.2 Method 2: Extracting municipality names from text in “village names”

For observations where the field “municipality name” was empty (all cases before 2004 and still a frequent situation afterwards), we turned our attention to the village name field. The objective was to determine whether these village names could potentially contain a municipality’s name that had been previously identified during the observatory surveys. For the observations with no municipality value, we extracted the first word or segment in the village name. This extracted word then underwent a process of fuzzy matching against the list of identified municipalities’ names. However, manual verification identified errors. Certain matches, which we labeled as ‘false positives’, were found to be errors and were removed from the results. Once this cleansing step was completed, we included the validated matches back into the primary correspondence table.

Code
# Filter out matched municipalities and extract the first word
unmatched_results <- correspondence_table %>%
  filter(is.na(matched_string)) %>%
  select(j0:clean_village) %>%
  mutate(first_word = str_extract(clean_village, "^[^\\s/]+"))

# Fuzzy matching with the first word and identified municipalities
results_step2 <- map2_df(unmatched_results$first_word, 
                         unmatched_results$j0, 
                         ~ as.data.frame(t(
                           fuzzy_match(.x, obs_communes, "clean_ADM3", .y))))  %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Update Results
potential_matches2 <- bind_cols(unmatched_results, results_step2)

# Manually identify false positives and remove them
false_positives <- c("madiromionga", "maroarla", "tsaratanteraka", 
                    "ambatoharanana", "ambatoaranana", "maroala", "erakoja", 
                    "erakoka", "maroalo", "erakka", "erakoa")
validated_matches2 <- potential_matches2 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(first_word %in% false_positives, NA, .)),
         method = ifelse(!is.na(matched_string), "method_2", NA_character_))

# Integrate new results in correspondence table
correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches2) %>%
  select(-first_word)

3.3.3 Method 3: Village name fuzzy matching with populated places

With some observations still devoid of a matched municipality, we initiated another processing layer. This phase saw the unmatched village names being subjected to fuzzy matching against the populated places dataset described above. We first sifted out the unmatched results from our prior analysis. Then, we augmented the ‘pop_places’ dataset with observatory codes for the entries that were located in communes recognized as surveyed by observatories. Next we performed another fuzzy matching, comparing the village names against the ‘pop_places’ names. To maintain a level of precision, we set a restrictive threshold, dismissing any match that exceeded a certain “distance” or degree of difference. But, as with the previous methods, visual verification revealed some mismatches. These ‘false positives’ were flagged and discarded and the remaining ones were integrated into the overarching correspondence table.

Code
# Re-filter unmatched results
unmatched_results2 <- correspondence_table %>%
  filter(is.na(matched_string)) %>%
  select(j0:clean_village)

# Join OBS_CODE to pop_places
pop_places <- pop_places %>%
  rename(ADM3_PCODE = COM_PCODE, ADM3_EN = COMMUNE) %>%
  mutate(ADM3_PCODE = str_replace(ADM3_PCODE, "^MDG", "MG")) %>%
  left_join(select(obs_communes, ADM3_PCODE, OBS_CODE) %>%
                     st_drop_geometry(), 
            by = "ADM3_PCODE")

# Apply the fuzzy matching observatory-wise
results_step3 <- map2_df(unmatched_results2$clean_village, 
                         unmatched_results2$j0, 
                         ~ as.data.frame(t(
                           fuzzy_match(.x, pop_places, "clean_pname", .y,
                                       max_distance = 0.22))))  %>% # erratic results beyond
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))

# Bind the results with unmatched_results_v2
potential_matches3 <- bind_cols(unmatched_results2, results_step3)

# Manually identify false positives and remove them
false_positives2 <- c("analambarika", "maroaloka", "ambakela", "ambodirofia",
                      "ambohibao", "ambato mangabe", "marofonaritra", 
                      "andranovo ambodimanga", "morataitra", "antanambao", 
                      "ankililoa")
validated_matches3 <- potential_matches3 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(clean_village %in% false_positives2, NA, .)),
         method = ifelse(!is.na(matched_string), "method_3", NA_character_))

correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches3)

3.3.4 Method 4

For the remaining, we try matching with other village names for which the municipality has been matched.

Even after the above measures, there remained observations that had eluded a municipality match. The subsequent strategy was to juxtapose them with other village names that had already been successfully matched to a municipality. We assembled a table comprising municipality names and their paired village names, using data from previous successful matches. Then the still-unmatched village names were run through a fuzzy matching process against the known village names. We fine-tuned the algorithm with a restrictive matching threshold. Again, some matches stood out as anomalies. Labeled as ‘false positives’, these were sifted out and the newly matched data was incorporated into the main correspondence table.

Code
# Create a list of municipality names and village names for matched observations
matched_villages <- correspondence_table %>%
  filter(!is.na(method)) %>%
  select(j0, ADM3_PCODE, ADM3_EN, clean_village) %>%
  distinct() %>%
  rename(OBS_CODE = j0)

# Re-filter unmatched results
unmatched_results3 <- correspondence_table %>%
  filter(is.na(method)) %>%
  select(j0:clean_village)

# Try matching unmatched villages against matched village names observatory-wise
results_village_match <- map2_df(
  unmatched_results3$clean_village, unmatched_results3$j0,
  ~ as.data.frame(t(fuzzy_match(.x, matched_villages, "clean_village",
                                .y, max_distance = 0.28)))) %>%
  unnest(cols = c(matched_string, distance, ADM3_PCODE, ADM3_EN))
# Bind these results with unmatched_results_v3
potential_matches4 <- bind_cols(unmatched_results3, results_village_match)

# Manually identify false positives and remove them
false_positives4 <- c("amp0mbibitika antanakova", "arakoke ambonano ampihamibe", 
                      "farara farara", "farara ambakela", "fara ambakela", 
                      "ambatotelo marofinaritra")
validated_matches4 <- potential_matches4 %>%
  mutate(across(c(matched_string, ADM3_PCODE, ADM3_EN, distance),
               ~ ifelse(clean_village %in% false_positives4, NA, .)),
         method = ifelse(!is.na(matched_string), "method_4", NA_character_))


# Merge the updated results back to the correspondence table
correspondence_table <- correspondence_table %>%
  filter(!is.na(matched_string)) %>%
  bind_rows(validated_matches4)

3.4 Matching villages

After successfully matching commune names with an official, normalized reference, our next goal is to align the villages as identified by the ROS with their administrative counterpart, known as Fokontany. This task presents challenges, notably because the ROS village identifications weren’t directly linked to Fokontany names, and the Fokontany names themselves have undergone changes. Additionally, variations in village naming within the ROS dataset complicate the process. Our approach involves several steps: First, we load the official COD Fokontany dataset. Next, we standardize the place names using our previously discussed clean_string function. Then, using the purpose-built fuzzy_match_village function, we assess the similarity between the village names in the ROS dataset and the Fokontany names within the same commune, specifically by calculating the Jaro-Winkler distance between the strings(see Section 4.2.1 for more details on the Jaro-Winkler distance). We select the names with the highest degree of matching. During a manual review of these results, we closely examined the matches, especially those with larger Jaro-Winkler distance. From this examination, we derived a distance of 0.2, below which the majority of matches were accurate. Notably, some matches above this threshold were deemed accurate and retained, while a small number of matches below this threshold were identified as false positives and subsequently discarded.

Code
# Load the ADM4 dataset, which corresponds to Fokontany
fokontany <- st_read("data/Spatial_data/OCHA_BNGRC admin boundaries/mdg_admbnda_adm4_BNGRC_OCHA_20181031.shp",
                      quiet = TRUE) %>%
  mutate(ADM4_clean = clean_string(ADM4_EN))

# A function to perform fuzzy matching with Fokontany or populated places
fuzzy_match_village <- function(target_string, dataframe, column_name, 
                                ADM3_PCODE_filter, id_column, 
                                municipality_string = NULL, max_distance = 0.4) {
  
  # Check if target_string has multiple words and if the first word matches the municipality_string
  if (!is.null(municipality_string) && length(unlist(strsplit(target_string, " "))) > 1) {
    first_word <- unlist(strsplit(target_string, " "))[1]
    distances_municipality <- stringdist::stringdist(first_word, municipality_string, method = "jw")
    if (min(distances_municipality, na.rm = TRUE) <= max_distance) {
      target_string <- str_remove(target_string, paste0("^", first_word, " "))
    }
  }
  
  filtered_reference <- dataframe %>%
                          filter(ADM3_PCODE == ADM3_PCODE_filter) %>%
                          select(all_of(column_name), all_of(id_column))
  
  # If filtered_reference is empty, return NA values
  if (nrow(filtered_reference) == 0) {
    return(list(matched_string = NA, ID = NA, geometry = NA, distance = 1))
  }
  
  # Use stringdist to find the closest match
  distances <- stringdist::stringdistmatrix(
    target_string, filtered_reference[[column_name]], method = "jw")
  
  # If there are no valid distances, set min_distance to Inf
  if(all(is.na(distances))) {
    min_distance <- Inf
  } else {
    min_distance <- min(distances, na.rm = TRUE)
  }
  
  # Check for Inf distance and replace it with 1
  if (is.infinite(min_distance)) {
    min_distance <- 1
  }
  
  # If min_distance exceeds the max_distance threshold, return NA values
  if (min_distance > max_distance) {
    return(list(matched_string = NA, ID = NA, geometry = NA, 
                distance = NA))
  }
  
  matched_row <- filtered_reference[which.min(distances), ]
  
  return(list(matched_string = matched_row[[column_name]], 
              ID = matched_row[[id_column]], 
              distance = min_distance))
}
# correspondence table without municipality geometry
correspondence_table2 <- correspondence_table %>%
  st_drop_geometry()

# For fokontany
results_fokontany <- pmap_df(
  list(clean_village = correspondence_table2$clean_village, 
       ADM3_PCODE = correspondence_table2$ADM3_PCODE, 
       municipality_string = correspondence_table2$clean_muni),
  function(clean_village, ADM3_PCODE, municipality_string) {
    as.data.frame(t(fuzzy_match_village(clean_village, fokontany, 
                                        "ADM4_clean", ADM3_PCODE, "ADM4_PCODE", municipality_string)))
  }
) %>%
   unnest(cols = everything()) %>%
  rename(ADM4_PCODE = ID, ADM4_clean = matched_string, distance2 = distance) %>%
  st_drop_geometry() %>%
  select(-geometry)
 
# Combine the results with correspondence_table
correspondence_table2_updated <- bind_cols(correspondence_table2, 
                                          results_fokontany) %>%
  mutate(method_village = ifelse(!is.na(ADM4_PCODE), "method_5", NA_character_))


# For a visual validation
correspondence_table2_updated %>% 
  arrange(desc(distance2)) %>%
  write_xlsx("correspondence_table2_update.xlsx")

# After instpecting the data, we find that a 0.2 threshold in distance is 
# appropriate. However some valid matches above this threshold should be kept

high_tolerance_valid_matches <- c(
  "ambanja", "ambararata 2", "ambaro", "ambatoharanana", "ambatomanga", 
  "ambazoamazava", "ambazoamirafy", "ambodibonara", "ambodimotso atsimo", 
  "ambohidrony", "ambohimahatsinjo atsinanana", "ambongabe", "amparafaravola", 
  "ampasy", "ampijoroa", "ampijoroan ala", "ampitana", "analambarika", 
  "andohasoamahainty", "anjiamangirana 1", "ankazoabo anivo", "ankerereake", 
  "ankililoaka 2", "bepako", "bevato", "erada 2", "esanta fototra", 
  "esanta marofoty", "esanta maromainty", "feramanga atsimo", "habohabo atsimo", 
  "habohabo avaratra", "lafiatsinanana", "madiromiongana", "manambaro 2", 
  "manantenina", "mangabe", "mangaoka", "maritampona", "maroala", "marofoty", 
  "marolampy", "marovantaza", "miary", "miary ankasy", "miary ankoronga", 
  "miorimivalana", "tanambao", "tanambao 2", "tandroka andrefana", 
  "tsarahasina", "vohilengo")

# Below this threshold, a few false positives should be removed
low_tolerance_invalid_marches <- c(
  "AMBOHIMAHATSINJO EST", "AMBOHIMAHATSINJO-EST/CENTRE", 
  "AMBALAKINDRESY/ANDOHARENINA", "ambazoa /ampiha", "AMBALAKINDRESY/ANDOHARENA", 
  "ANKAZOABO SUD/ALAMBARIKE", "ANKARINEZAKA/NAMARINA", "AMBALA/ANDRANOLAVA", 
  "AMBALAKINDRESY/ANDRANOLAVA", "AMBALAKINDRESY/ANDRANOLAVA")

correspondence_table3 <- correspondence_table2_updated %>%
  mutate(
    across(c(ADM4_clean, ADM4_PCODE, distance2, method_village), 
           ~ case_when(
             (distance2 > 0.2 & !ADM4_clean %in% high_tolerance_valid_matches) |
             ADM4_clean %in% low_tolerance_invalid_marches ~ NA,
             TRUE ~ .)))

3.5 Validating the Quality of Georeferenced Data

To ensure the robustness and validity of our georeferenced data, we employ a multi-tiered validation approach that hinges on quantitative metrics and qualitative consistency checks.

3.5.1 Quantitative metrics

We first look at the distribution of the unique observations that were successfully matched during the various stages of the georeferencing process.

Code
# Re-format modality labels  
matching_tbl <- correspondence_table3 %>%
  mutate(method = case_when(
    method == "method_1" ~ "Method 1",
    method == "method_2" ~ "Method 2",
    method == "method_3" ~ "Method 3",
    method == "method_4" ~ "Method 4",
    is.na(method) ~ "Unmatched communes"),
    method_village = case_when(
      method_village == "method_5" ~ "Matched villages",
      is.na(method_village) ~ "Unmatched villages"))

summarized_data <- matching_tbl %>%
  group_by(method, method_village) %>%
  summarize(total_obs = sum(obs_count, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(percent_obs = total_obs/sum(total_obs))

count_tbl <- summarized_data %>%
  select(method, method_village, total_obs) %>%
  pivot_wider(names_from = method_village, values_from = total_obs, 
              values_fill = 0) %>%
  rowwise() %>%
  mutate(Total = sum(c(`Matched villages`, `Unmatched villages`))) %>%
  ungroup()  %>%
  adorn_totals("row")

percent_tbl <- summarized_data %>%
  select(method, method_village, percent_obs) %>%
  pivot_wider(names_from = method_village, values_from = percent_obs, 
              values_fill = 0) %>%
  rowwise() %>%
  mutate(`Total (%)` = sum(c(`Matched villages`, `Unmatched villages`))) %>%
  ungroup() %>%
  rename(`Matched villages (%)` = `Matched villages`,
         `Unmatched villages (%)` = `Unmatched villages`) %>%
  mutate(across(where(is.numeric), ~ round(.x * 100, 2)),
         across(where(is.numeric))) %>%
  adorn_totals("row")


summarized_data2 <- count_tbl %>%
  left_join(percent_tbl, by = "method") %>%
  relocate(method, `Matched villages`, `Matched villages (%)`, 
           `Unmatched villages`, `Unmatched villages (%)`,
           Total, `Total (%)`) %>%
  rename(`Commune matching method` = method)

gt(summarized_data2)  %>%
  # Add a title
  tab_header(title = "Result of matching process",
             subtitle = "Number of survey observation per method") 
Table 3.1:

Number of ROS observation which commune and village were matched

Result of matching process
Number of survey observation per method
Commune matching method Matched villages Matched villages (%) Unmatched villages Unmatched villages (%) Total Total (%)
Method 1 30925 30.69 12605 12.51 43530 43.20
Method 2 16491 16.37 6727 6.68 23218 23.04
Method 3 10723 10.64 4006 3.98 14729 14.62
Method 4 3843 3.81 7198 7.14 11041 10.96
Unmatched communes 0 0.00 8237 8.18 8237 8.18
Total 61982 61.51 38773 38.49 100755 100.00

3.5.2 Consistency check with documentation

To ensure the comprehensiveness of our georeferencing process, we summarize the list of unique matches for villages from all the municipalities in the final dataset. This check will help ensure there are no glaring gaps in the matched data.

Code
fokontany_list <- correspondence_table3 %>%
  mutate(j0 = as.numeric(j0)) %>%
  left_join(observatory_names, by = c("j0" = "code")) %>%
  rename(OBS_NAME = name) %>%
  left_join(select(st_drop_geometry(fokontany), ADM4_PCODE, ADM4_EN),
            by = "ADM4_PCODE") %>%
  group_by(OBS_NAME, ADM3_EN) %>%
  summarize(ADM4_EN = list(unique(if_else(is.na(ADM4_EN), "Unknown", ADM4_EN)))) %>%
  ungroup() %>%
  mutate(ADM4_EN = map_chr(ADM4_EN, ~ str_c(.x, collapse = ", ")))

fokontany_list %>%
  rename(`Observatory Name` = OBS_NAME,
         `Commune Name` = ADM3_EN,
         `Fokontany Names` = ADM4_EN) %>%
  DT::datatable(caption = "List of fokontany per commune and observatory")

?(caption)

We compare this with the ROS documentation to assess which villages might have been omitted by our algorithm.

3.5.3 Visual exploration and avenues for continuous corrections

We visually explore the the georeferenced data to observe its spatial distribution and spot eventual anomalies or clustering patterns that might not be immediately obvious from numerical summaries. The visualization below maps out our matched municipalities alongside villages. Discrepancies or clusters may indicate a need for further refinement or correction.

Code
commmune_pcode_years <- correspondence_table3 %>%
  # Split and unnest the years
  mutate(years = str_split(years, ", ")) %>%
  unnest(years) %>%
  # Convert years to numeric for proper sorting
  mutate(years = as.numeric(years)) %>%
  # Group by ADM3_PCODE and extract unique, sorted years
  group_by(ADM3_PCODE) %>%
  summarize(years = list(unique(years))) %>%
  ungroup() %>%
  # Sort and collapse the years
  mutate(years = map_chr(years, ~ paste(sort(.x), collapse = ", ")))

fokontany_pcode_years <- correspondence_table3 %>%
  # Split and unnest the years
  mutate(years = str_split(years, ", ")) %>%
  unnest(years) %>%
  # Convert years to numeric for proper sorting
  mutate(years = as.numeric(years)) %>%
  # Group by ADM3_PCODE and extract unique, sorted years
  group_by(ADM4_PCODE) %>%
  summarize(years = list(unique(years))) %>%
  ungroup() %>%
  # Sort and collapse the years
  mutate(years = map_chr(years, ~ paste(sort(.x), collapse = ", ")))

selected_communes <- commmune_pcode_years %>%
  filter(!is.na(ADM3_PCODE)) %>%
  left_join(obs_communes, by = "ADM3_PCODE") %>%
  st_sf()

selected_fokontany <- fokontany_pcode_years %>%
  filter(!is.na(ADM4_PCODE)) %>%
  left_join(fokontany, by = "ADM4_PCODE") %>%
  st_sf()  %>%
  mutate(label = "Surveyed______")

write_rds(selected_communes, "output/selected_communes.rds")
write_rds(selected_fokontany, "output/selected_fokontany.rds")

tmap_mode("view")
tm_shape(selected_communes) + 
  tm_fill(col = "OBS_NAME", palette = "Set1", title = "Observatory",
          id = "ADM3_EN", legend.show = FALSE,
          popup.vars = c("Observatory" = "OBS_NAME",
                         "Data collection years"  = "years",
                         "District" = "ADM2_EN",
                         "Region" = "ADM1_EN")) +
  tm_shape(selected_fokontany) + 
  tm_fill(col = "label", palette = c("black"), alpha = 0.6, title = "Fokontany", 
          id = "ADM4_EN", legend.show = FALSE,
          popup.vars = c("Observatory" = "OBS_NAME",
                         "Data collection years"  = "years",
                         "Commune" = "ADM3_EN",
                         "District" = "ADM2_EN",
                         "Region" = "ADM1_EN")) +
  tm_scale_bar()
Figure 3.2: Map of the surveyed communes and fokonany

From this summary table, we see that the communes could not be re-identified for 8.18% of the observations and the village could not be re-identified for 38.49% of observations. A manual work to identify villages will have to be implemented to be able to improve this matching.

3.6 Conclusion

Given the variability in location labels recorded in the ROS data, we emphasized standardizing and aligning toponyms. We introduced the clean_string function to unify place names, addressing variances in language, letter casing, and other descriptors. Utilizing fuzzy matching, we connected similar text strings, using the Levenshtein distance as a benchmark. To heighten accuracy, a hierarchical data matching strategy was applied, reflecting the inherent structure of geographical units, where regions envelop districts, which further consist of municipalities and localities. The COD data, curated by BNGRC and OCHA, was our reference point for these matches. Various techniques, including hierarchical fuzzy matching and visual checks, were employed.

As a result, we could automatically match 98% of commune name variations in the dataset with an official commune, and 62% of village name variations with a fokontany. The outcomes are consistent with ROS’s internal documentation, and the spatial representations can be further inspected for validation. Future corrections and enhancements are anticipated, and our system, leveraging Quarto notebook and reproducible R code, will facilitate these refinements.

ROS began its data collection at a time when geolocation best practices weren’t as evolved as they are today. Should more survey rounds be conducted in the future, the intricate georeferencing process demonstrated here should serve both as a cautionary tale and a guide on optimizing data registration (for instance, using PCODES). We’re optimistic that refining the georeferencing of past ROS data, as presented in this work, will amplify the data’s utility for both research and policy decisions.