Is it possible to scrape a series of nodes after certain text string?

I would like to scrape a series of html_nodes from a series of pages. The problem comes when those elements are inside a list which does not have any class nor id. I can’t use XPATH neither because the position of the desired elements differ from one page to another depending on the previous information.

Detailed information:

Sample page: https://www.fablabs.io/machines/othermill
Target: I would like to scrape the name of all fablabs that are using that specific machine. And how it can be integrated with Apache Spark.
The html code (fragment) looks like this:

Available at
  • The%20beach%20lab%20%28mobile%29The Beach Lab x Middle East
  • ...
Since there are no nested classes nor ids, my only option would be using Xpath like this:

fablabs = url %>%
html_nodes(xpath = ‘/html/body/div[2]/div[2]/div[2]/ul[3]/li/a’) %>%
html_text()
Unfortunately, although this would work for this page, will not work in other pages, as the position of this list changes from page to page depending on its previous content.

the only thing I know is that I would like to scrape something that is below the string Available at. Is there any way to achieve that in R?

2 Likes

I’d suggest you to open an issue regarding this, we are working on improving the APIs of Fablabs.io (and all its technical infrastructure) and this would help us to know which are your needs and how to improve the APIs for this. So please give us a detailed suggestion of what you’d need, howthe data would be structured and what would be the use for this. It might take a bit more of time than just scraping, but then you’d have the data already formatted :slight_smile:

Please open an issue on the GitHub repo here: https://github.com/fablabbcn/fablabs.io

Furthermore: we are working on several data analyses of Fablabs.io to be integrated in Fablabs.io (but in Python!) let me know what are you interested into, maybe we can try to integrate your research in the platform as well!!

Yes, It is possible to scrape a series of nodes after certain text string.