Libraries needed for this tutorial:
library(httr)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
This tutorial uses the library httr
to establish the
connection with the GraphiQL API, but there are also other options to
interact with GraphQL from R. Please check the R packages: ghql, gqlr and graphql.
GraphQL is a query language for Application Programming Interfaces (APIs). Queries are written in the GraphQL language, and the result (the data) is given back in JSON format.
If you are not familiar with GraphQL, we recommend you to start by checking the Introduction to GraphQL and querying the API of Zendro How to Guides.
Zendro provides a GraphQL API web interface, called GraphiQL, which is a Web Browser tool for writing, validating, and testing GraphQL queries.
For example, try copy-pasting and executing the following query at https://zendro.conabio.gob.mx/api/graphql, which is the API that we will be using in this and other tutorials.
{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
(The example above only gets the first 10 results, in a section of
this tutorial we will explain how to define pagination
to
pull down a given number, or all, of the items in a dataset.)
The function get_from_graphQL()
defined below queries a
GraphQL API and transforms the data from JSON format (which is the
output of GraphQL) into a R data frame object you can easily use for
further analyses. If you want to now what’s going on inside this
function, there is an step-by-step detailed description at the end of
this document.
To start using get_from_graphQL()
first run the code
below to load the function into your R environment (you can also have it
as a different file and use source()
to run it):
get_from_graphQL<-function(query, url){
### This function queries a GraphiQL API and outpus the data into a single data.frame
## Arguments
# query: a graphQL query. It should work if you try it in graphiQL server. Must be a character string.
# url = url of the server to query. Must be a character string.
## Needed libraries:
# library(httr)
# library(jsonlite)
# library(dplyr)
# library(stringr)
### Function
## query the server
result <- POST(url, body = list(query=query), encode=c("json"))
## check server response
satus_code<-result$status_code
if(satus_code!=200){
print(paste0("Oh, oh: status code ", satus_code, ". Check your query and that the server is working"))
}
else{
# get data from query result
jsonResult <- content(result, as = "text")
# check if data downloaded without errors
# graphiQL will send an error if there is a problem with the query and the data was not dowloaded properly, even if the connection status was 200.
### FIX this when != TRUE because result is na
errors<-grepl("errors*{10}", jsonResult)
if(errors==TRUE){
print("Sorry :(, your data downloaded with errors, check your query and API server for details")
}
else{
# transform to json
readableResult <- fromJSON(jsonResult,
flatten = T) # this TRUE is to combine the different lists into a single data frame (because data comming from different models is nested in lists)
# get data
data<-as.data.frame(readableResult$data[1])
# rename colnames to original variable names
x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames
return(data)
}
}
}
get_from_graphQL()
allows you to get data of up to 1,000
elements (results of your query) at a time, which is the maximum number
allowed by GraphQL for a single batch. In the next section we explain
how to use pagination
to download larger datasets in
batches.
To use the get_from_graphQL()
function, first you have
to define a GraphQL query. If you don’t know how to do this, start by
checking the Introduction to GraphQL and
querying the API of Zendro How to Guides.
Once you have a GraphQL query working, you’ll need to save it to an R object as a character vector:
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
"
Next we use this query as an argument for
get_from_graphQL()
, along with the url of the API, which is
the same of the GraphiQL web interface you explored above:
data<-get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql")
If all wen’t well you will get a data frame with the result of your query:
head(data)
GraphQL outputs the resutls of a query in batches of max 1,000
elements. So if the data you want to download is larger than that, then
you need to paginate, i.e. to get the data in batches.
pagination
is is an argument within GraphQL queries that
could be done by:
Limit-offset: indicating the first element to get
(offset
, default 0) and the number of elements to get
(limit
). The limit
can’t be larger than
1000
.
Cursor-based: indicating the unique ID
(cursor
) of the element to get first, and a number of
elements to get after.
Zendro uses the limit-offset pagination with the syntaxis:
pagination:{limit:[integer], offset:[integer]}
See GraphQL documentation and this tutorial on GraphQL pagination for more details.
In the previous examples we downloaded only 10 elements
(pagination:{limit:10})
) from the rivers type, but the
dataset is larger. (Remember, data in GraphQL is organised in
types and fields within those types.
When thinking about your structured data, you can think of types as the
names of tables, and fields as the columns of those tables. In the
example above rivers
is a type and the fields are
river_id
, name
, length
among
others.)
To know how many elements does a type has we can make a query with
the function count
, if it is available for the type we are
interested on. You can check this in the Docs
at the top
right menu of the GraphiQL interface.
For example, rivers
has the function
countRivers
so with the query {countRivers}
we
can get the total number of rivers.
Similar to how we got data before, you can use this very simple query
in the function get_from_graphQL
to get the number of
rivers into R:
# query API with count function
no_records<-get_from_graphQL(query="{countRivers}", url="https://zendro.conabio.gob.mx/api/graphql")
# change to vector, we don't need a df
no_records<-no_records[1,1]
no_records
## [1] 50
In this case we have 50. Technically we could download all the data in a single batch because it is <1000, but for demostration purposes we will download it in batches of 10.
The following code calculates the number of pages needed to get a
given number of records assuming a desired limit (size of each batch).
Then it runs get_from_graphQL()
within a loop for each page
until getting the total number of records desired.
# Define desired number of records and limit. Number of pages and offset will be estimated based on the number of records to download
no_records<- no_records # this was estimated above with a query to count the total number of records, but you can also manually change it to a custom desired number
my_limit<-10 # max 1000.
no_pages<-ceiling(no_records/my_limit)
## Define offseet.
# You can use the following loop:
# to calculate the offset automatically based on
# on the number of pages needed.
my_offset<-0 # start in 0. Leave like this
for(i in 1:no_pages){ # loop to
my_offset<-c(my_offset, my_limit*i)
}
# Or you can define the offset manually
# uncommenting the following line
# and commenting the loop above:
# my_offset<-c(#manually define your vector)
## create object where to store downloaded data. Leave empty
data<-character()
##
## Loop to download the data from GraphQL using pagination
##
for(i in c(1:length(my_offset))){
# Define pagination
pagination <- paste0("limit:", my_limit, ", offset:", my_offset[i])
# Define query looping through desired pagination:
my_query<- paste0("{
rivers(pagination:{", pagination, "}){
river_id
name
length
}
}
")
# Get data and add it to the already created df
data<-rbind(data, get_from_graphQL(query=my_query, url="https://zendro.conabio.gob.mx/api/graphql"))
#end of loop
}
As a result you will get all the data in a single df:
head(data)
summary(data)
## river_id name length
## Length:50 Length:50 Min. : 65.0
## Class :character Class :character 1st Qu.: 150.0
## Mode :character Mode :character Median : 283.0
## Mean : 347.1
## 3rd Qu.: 402.5
## Max. :1521.0
## NA's :6
get_from_graphQL()
explained step by stepThe following is a step-by-step example explaining with more detail
how does the function get_from_graphQL()
that we used above
works.
First, once you have a GraphQL query working, you’ll need to save it to an R object as a character vector:
my_query<- "{
rivers(pagination:{limit:10, offset:0}){
river_id
name
length
}
}
"
Next, define as another character vector the url of the API, which is the same of the GraphiQL web interface you explored above:
url<-"https://zendro.conabio.gob.mx/api/graphql"
Now we can a query to the API by using a POST request:
# query server
result <- POST(url, body = list(query=my_query), encode = c("json"))
The result that we are getting is the http
response.
Before checking if we got the data, it is good practice to verify if the
connection was successful by checking the status code. A
200
means that all went well. Any other code means
problems. See this.
# check server response
result$status_code
## [1] 200
We now need to extract the data in order to be able to manipulate it.
If everything went well, the http
response will contain an
attribute data which will itself contain an attribute named as the
query, in this case rivers
.
result
## Response [https://zendro.conabio.gob.mx/api/graphql]
## Date: 2022-07-27 23:15
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 983 B
## {
## "data": {
## "rivers": [
## {
## "river_id": "1",
## "name": "Acaponeta",
## "length": 233
## },
## {
## "river_id": "10",
## ...
If the query is not written properly or if there is any other error,
the attribute data
won’t exist and instead we will get the
attribute erros
listing the errors found.
If all wen’t well we can proceed to extract the content of the results with:
# get data from query result
jsonResult <- content(result, as = "text")
The result will be in json format, which we can convert into an
Robjet (list). In this list the results are within each type used in the
query. The argumment flatten
is used to collapse the list
into a single data frame the data from different types.
# transform to json
readableResult <- fromJSON(jsonResult,
flatten = T)
Extract data:
# get data
data<-as.data.frame(readableResult$data[1])
head(data)
By default, the name of each type will be added a the beggining of each column name:
colnames(data)
## [1] "rivers.river_id" "rivers.name" "rivers.length"
To keep only the name of the variable as it is in the original data:
x<-str_match(colnames(data), "\\w*$")[,1] # matches word characters (ie not the ".") at the end of the string
colnames(data)<-x # assing new colnames
So finally we have the data in a single nice looking data frame:
head(data)
Notice that you will get a dataframe like teh one above only for one to one associations, but than in other cases you still will get variables that are a list, which you can process in a separate step.