Libraries needed for this tutorial:

library(httr)
library(jsonlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

This tutorial uses the library httr to establish the connection with the GraphiQL API, but there are also other options to interact with GraphQL from R. Please check the R packages: ghql, gqlr and graphql.

Introduction to GraphQL API and how to query it

GraphQL is a query language for Application Programming Interfaces (APIs). Queries are written in the GraphQL language, and the result (the data) is given back in JSON format.

If you are not familiar with GraphQL, we recommend you to start by checking the Introduction to GraphQL and querying the API of Zendro How to Guides.

Zendro provides a GraphQL API web interface, called GraphiQL, which is a Web Browser tool for writing, validating, and testing GraphQL queries.

For example, try copy-pasting and executing the following query at https://zendro.conabio.gob.mx/api/graphql, which is the API that we will be using in this and other tutorials.

{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
}
} 

(The example above only gets the first 10 results, in a section of this tutorial we will explain how to define pagination to pull down a given number, or all, of the items in a dataset.)

Download a small dataset (<1,000 elements):

The function get_from_graphQL() defined below queries a GraphQL API and transforms the data from JSON format (which is the output of GraphQL) into a R data frame object you can easily use for further analyses. If you want to now what’s going on inside this function, there is an step-by-step detailed description at the end of this document.

To start using get_from_graphQL() first run the code below to load the function into your R environment (you can also have it as a different file and use source() to run it):

get_from_graphQL<-function(query, url){
### This function queries a GraphiQL API and outpus the data into a single data.frame 

## Arguments
# query: a graphQL query. It should work if you try it in graphiQL server. Must be a character string.
# url = url of the server to query. Must be a character string.

## Needed libraries:
# library(httr)
# library(jsonlite)
# library(dplyr)
# library(stringr)

### Function

##  query the server
result <- POST(url, body = list(query=query), encode=c("json"))

## check server response
satus_code<-result$status_code

if(satus_code!=200){
  print(paste0("Oh, oh: status code ", satus_code, ". Check your query and that the server is working"))
}

else{
  
  # get data from query result
  jsonResult <- content(result, as = "text") 
  
  # check if data downloaded without errors
  # graphiQL will send an error if there is a problem with the query and the data was not dowloaded properly, even if the connection status was 200. 
  ### FIX this when != TRUE because result is na
  errors<-grepl("errors*{10}", jsonResult)
  if(errors==TRUE){
    print("Sorry :(, your data downloaded with errors, check your query and API server for details")
  } 
  else{ 
  # transform to json
  readableResult <- fromJSON(jsonResult, 
                           flatten = T) # this TRUE is to combine the different lists into a single data frame (because data comming from different models is nested in lists)
    
  # get data
  data<-as.data.frame(readableResult$data[1]) 
  
  # rename colnames to original variable names
  x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
  colnames(data)<-x # assing new colnames
  return(data)
    }
  }
}

get_from_graphQL() allows you to get data of up to 1,000 elements (results of your query) at a time, which is the maximum number allowed by GraphQL for a single batch. In the next section we explain how to use pagination to download larger datasets in batches.

To use the get_from_graphQL() function, first you have to define a GraphQL query. If you don’t know how to do this, start by checking the Introduction to GraphQL and querying the API of Zendro How to Guides.

Once you have a GraphQL query working, you’ll need to save it to an R object as a character vector:

my_query<- "{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
   }
}
"

Next we use this query as an argument for get_from_graphQL(), along with the url of the API, which is the same of the GraphiQL web interface you explored above:

data<-get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql")

If all wen’t well you will get a data frame with the result of your query:

head(data)

Download a dataset with more than >1,000 elements:

GraphQL outputs the resutls of a query in batches of max 1,000 elements. So if the data you want to download is larger than that, then you need to paginate, i.e. to get the data in batches. pagination is is an argument within GraphQL queries that could be done by:

Zendro uses the limit-offset pagination with the syntaxis:

pagination:{limit:[integer], offset:[integer]}

See GraphQL documentation and this tutorial on GraphQL pagination for more details.

In the previous examples we downloaded only 10 elements (pagination:{limit:10})) from the rivers type, but the dataset is larger. (Remember, data in GraphQL is organised in types and fields within those types. When thinking about your structured data, you can think of types as the names of tables, and fields as the columns of those tables. In the example above rivers is a type and the fields are river_id, name, length among others.)

To know how many elements does a type has we can make a query with the function count, if it is available for the type we are interested on. You can check this in the Docs at the top right menu of the GraphiQL interface.

For example, rivers has the function countRivers so with the query {countRivers} we can get the total number of rivers.

Similar to how we got data before, you can use this very simple query in the function get_from_graphQL to get the number of rivers into R:

# query API with count function
no_records<-get_from_graphQL(query="{countRivers}", url="https://zendro.conabio.gob.mx/api/graphql")

# change to vector, we don't need a df
no_records<-no_records[1,1]
no_records
## [1] 50

In this case we have 50. Technically we could download all the data in a single batch because it is <1000, but for demostration purposes we will download it in batches of 10.

The following code calculates the number of pages needed to get a given number of records assuming a desired limit (size of each batch). Then it runs get_from_graphQL() within a loop for each page until getting the total number of records desired.

# Define desired number of records and limit. Number of pages and offset will be estimated based on the number of records to download
no_records<- no_records # this was estimated above with a query to count the total number of records, but you can also manually change it to a custom desired number
my_limit<-10 # max 1000. 
no_pages<-ceiling(no_records/my_limit)

## Define offseet.
# You can use the following loop:
# to calculate the offset automatically based on 
# on the number of pages needed.
my_offset<-0 # start in 0. Leave like this
for(i in 1:no_pages){ # loop to 
  my_offset<-c(my_offset, my_limit*i)
}

# Or you can define the offset manually 
# uncommenting the following line
# and commenting the loop above:
# my_offset<-c(#manually define your vector) 

## create object where to store downloaded data. Leave empty
data<-character()

##
## Loop to download the data from GraphQL using pagination
## 

for(i in c(1:length(my_offset))){

# Define pagination
pagination <- paste0("limit:", my_limit, ", offset:", my_offset[i])

# Define query looping through desired pagination:
my_query<- paste0("{
  rivers(pagination:{", pagination, "}){
      river_id
      name
      length
   }
   } 
   ")



# Get data and add it to the already created df
data<-rbind(data, get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql"))

#end of loop
}

As a result you will get all the data in a single df:

head(data)
summary(data)
##    river_id             name               length      
##  Length:50          Length:50          Min.   :  65.0  
##  Class :character   Class :character   1st Qu.: 150.0  
##  Mode  :character   Mode  :character   Median : 283.0  
##                                        Mean   : 347.1  
##                                        3rd Qu.: 402.5  
##                                        Max.   :1521.0  
##                                        NA's   :6

get_from_graphQL() explained step by step

The following is a step-by-step example explaining with more detail how does the function get_from_graphQL() that we used above works.

First, once you have a GraphQL query working, you’ll need to save it to an R object as a character vector:

my_query<- "{
rivers(pagination:{limit:10, offset:0}){
      river_id
      name
      length
   }
}
"

Next, define as another character vector the url of the API, which is the same of the GraphiQL web interface you explored above:

url<-"https://zendro.conabio.gob.mx/api/graphql"

Now we can a query to the API by using a POST request:

# query server
result <- POST(url, body = list(query=my_query), encode = c("json"))

The result that we are getting is the http response. Before checking if we got the data, it is good practice to verify if the connection was successful by checking the status code. A 200 means that all went well. Any other code means problems. See this.

# check server response
result$status_code
## [1] 200

We now need to extract the data in order to be able to manipulate it. If everything went well, the http response will contain an attribute data which will itself contain an attribute named as the query, in this case rivers.

result
## Response [https://zendro.conabio.gob.mx/api/graphql]
##   Date: 2022-07-27 23:15
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 983 B
## {
##   "data": {
##     "rivers": [
##       {
##         "river_id": "1",
##         "name": "Acaponeta",
##         "length": 233
##       },
##       {
##         "river_id": "10",
## ...

If the query is not written properly or if there is any other error, the attribute data won’t exist and instead we will get the attribute erros listing the errors found.

If all wen’t well we can proceed to extract the content of the results with:

# get data from query result
jsonResult <- content(result, as = "text") 

The result will be in json format, which we can convert into an Robjet (list). In this list the results are within each type used in the query. The argumment flatten is used to collapse the list into a single data frame the data from different types.

# transform to json
readableResult <- fromJSON(jsonResult, 
                         flatten = T)

Extract data:

# get data
data<-as.data.frame(readableResult$data[1]) 
head(data)

By default, the name of each type will be added a the beggining of each column name:

colnames(data)
## [1] "rivers.river_id" "rivers.name"     "rivers.length"

To keep only the name of the variable as it is in the original data:

x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames 

So finally we have the data in a single nice looking data frame:

head(data) 

Notice that you will get a dataframe like teh one above only for one to one associations, but than in other cases you still will get variables that are a list, which you can process in a separate step.