I am quite new on R and am currently trying to do web scrapping on the ISO website. I have trouble using Selenium : it does not seem to be compatible with my chrome version, despite me trying to install another version of the latter. I have seen several people with a problem close to mine but their solution doesn’t seem to work for me.

Here is my project : I want to store in an R dataframe the title of the page results that appear when i search for “packaging” on the ISO website, in the “standards” section. I want my code to, for each result, take the name of the norm, the number, open the link, take the number of pages of the norm as well as the date of publication. My current code only tries to extract ISO number (the code number) and titles of the results, not click on the results to open the page associated with each result, and take the number of pages and date of publication.

 #### WEB SCRAPPING NORMES ISO ####

  library(dplyr)

  ## 1. I get the html code of the first page of results to understand the structure of the html
  # code. I should probably do the same for the first page that I can open by clicking on the 
  # first result.

  library(rvest)
  library(xml2)
  url <- "https://www.iso.org/search.html? PROD_isoorg_en%5Bquery%5D=packaging&PROD_isoorg_en%5Bmenu%5D%5Bfacet%5D=standard"
  page <- read_html(url)
  html_content <- as.character(page)
  cat(html_content)
  html_structure <- xml_structure(page)

 ## 2. I get the html code from each the 24 pages of results (there are 470 results in total)

 library(rvest)
 library(purrr)
  base_url <- "https://www.iso.org/search.html?PROD_isoorg_en%5Bquery%5D=packaging&PROD_isoorg_en%5Bmenu%5D%5Bfacet%5D=standard&page="
   extract_html <- function(page_number) {
   url <- paste0(base_url, page_number)
   page <- read_html(url)
    html_content <- as.character(page)
    return(html_content)
   }
  num_pages <- 24  # 470 résultats/ 20 results per page = 23.5
    html_contents <- map(1:num_pages, extract_html)


   ## package RSelenium to automate web browser interactions, such as clicking on elements,    filling out forms, and scrolling. 
    # RSelenium simulates user behavior, enabling you to access dynamically loaded content.

  # I install java 

  java_path <- "C:/Users/cogez/Downloads/jre-8u401-windows-x64/bin"
   Sys.setenv(PATH = paste(java_path, Sys.getenv("PATH"), sep = .Platform$path.sep))
   print(Sys.getenv("PATH"))
   library(RSelenium)

   # Start a remote driver

   chrome_driver_version <- "123.0.6312.58"
   driver <- rsDriver(browser = "chrome", chromever = chrome_driver_version)


   remDr <- driver[["client"]]
   remDr$navigate("https://www.iso.org/search.html?PROD_isoorg_en%5Bquery%5D=packaging&PROD_isoorg_en%5Bmenu%5D%5Bfacet%5D=standard")


  # Function to extract ISO numbers and titles from HTML content
    extract_iso_numbers_titles <- function(html_content) {
     page <- read_html(html_content)
     iso_numbers_titles <- page %>% 
      html_nodes("div.h5.card-title a") %>% 
      html_text()

     # Extract ISO numbers and titles from combined text
    iso_numbers <- sub("ISO (\d+:\d+).*", "\1", iso_numbers_titles)
     titles <- sub(".* — ", "", iso_numbers_titles)

    # Combine ISO numbers and titles into a data frame
     results <- data.frame(
       ISO_Number = iso_numbers,
       Title = titles
    )

    return(results)
  }

  # Function to extract solutions from each page
   extract_solutions <- function(html_content) {
     # Call the function to extract ISO numbers and titles
    iso_titles <- extract_iso_numbers_titles(html_content)

    # Return ISO numbers and titles
    return(iso_titles)
  }

I would welcome any help : first to solve my current code, and second to update it to enable me to use it to click on each result, open the corresponding page and collect the two additional informations i want to store in my dataframe for each iso norm containing the term “packaging” in the title (ie : the number of pages of the ISO norm, and the date of publication of the ISO norm).

Thanks in advance !

New contributor

Laura cogez is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.