Summative Portfolio: L4

Author

Steve Crawshaw

Published

April 26, 2023

1 Introduction

My name is Steve Crawshaw. I am a project manager for Bristol City Council and I have undertaken the L4 Apprenticeship (Data Analyst) to improve my knowledge and skill in data analytics with a view to undertaking the L7 apprenticeship and achieving a role as a data scientist.

I have worked for Bristol City Council (BCC) since 1998 in essentially the same role, although I was seconded to another organisation between 2013 and 2016. My main role now is managing a network of air quality monitors and the data that they generate. This will be the subject on which I will focus for the summative portfolio.

2 Employer: Bristol City Council

Bristol City Council is a large unitary local authority in the South West of England.

2.1 Goals, Vision and Values

Bristol City Council’s Corporate Strategy outlines a vision of driving an inclusive, sustainable and healthy city of hope and aspiration where everyone can share the city’s success. It also describes the activities required by law.

The Corporate Strategy’s main priorities are informed by 5 key principles.

  • Development and delivery
  • Environmental sustainability
  • Equality and inclusion
  • Resilience
  • World-class employment

It’s also arranged around 7 main themes:

  • Children and young people: A city where every child belongs and every child gets the best start in life, whatever circumstances they were born into.

  • Economy and skills: Economic growth that builds inclusive and resilient communities, decarbonises the city and offers equity of opportunity.

  • Environment and sustainability: Decarbonise the city, support the * recovery of nature and lead a just transition to a low-carbon future.

  • Health, care and wellbeing: Tackle health inequalities to help people stay healthier and happier throughout their lives.

  • Homes and communities: Healthy, resilient, and inclusive neighbourhoods with fair access to decent, affordable homes.

  • Transport and connectivity: A more efficient, sustainable, and inclusive connection of people to people, people to jobs and people to opportunity.

  • Effective development organisation: From city government to city governance: creating a focussed council that empowers individuals, communities, and partners to flourish and lead.

Bristol City Council is currently a mayoral - led authority. The current Mayor, Marvin Rees has set out values for the organisation as shown below.

Bristol City Council’s values

2.2 How BCC Uses Data

BCC is a large and complex organisation dealing with a wide range of functions, from managing highways and planning applications to looking after vulnerable people. It follows that multiple systems and approaches exist for managing data across the organisation, many of which have evolved over time and have not been centrally planned or managed.

A recently published “Data, Insight and Information Strategy” sets out the strategic direction and objectives for the council in this area as follows:

  • Objective 1 - Create the right culture and environment for great collaboration amongst teams and partners

  • Objective 2 - Become an insight - rich data driven organisation to improve performance

  • Objective 3 - Manage our information safely, securely and appropriately

  • Objective 4 - Create a more efficient and effective approach to the use of data

  • Objective 2 - Establish the supply and use of data as a key enabler of a more connected, smarter city

The strategy aims to deliver these objectives by developing a single analytics team to deliver insights across the organisation and by consolidating the disparate and distributed datasets across the council into a corporate data lake with analysis being done on one data analytics platform (Power BI). Several projects are under way to deliver this change. (K2)

2.3 Data Architecture

No specific document exists about BCC’s data architecture, however there is an information asset taxonomy, which is summarised in the diagram below. Only the “people” element is shown. This was developed in 2017 by an officer who has now left BCC so the information may be somewhat outdated.

Information Asset Taxonomy

Much of this data architecture is not of direct relevance to my work as it pertains to other departments and functions. In my own work area, the main data architecture which I design, maintain and operate is summarised as follows:

  • 182 passive samplers (Diffusion tubes) providing monthly concentrations of pollutants (NO2)
  • MS Access database (K6) to hold diffusion tube and monitoring network meta data
  • A network of 8 real time air quality monitors (analysers)
  • A fleet of 4G routers providing telemetry to connect these devices to:
  • A proprietary communications software suite (Envista Commcentre)
  • A database client and analytics program (Envista ARM) (K6)
  • A SQL server database hosted on Azure (K6, S9)
  • An open data (OD) platform provided by Opendatasoft (K4)
  • FME and FME Server processes to Extract, Transform and Load (ETL) air quality data to the open data platform (K10)
  • Dashboards, visualisation and analytics delivered through the OD platform (S14)
  • Bespoke reporting and ETL pipelines delivered through R, sourcing data through the OD portal and the Envista database (S14, S10, S2, K14, K7)

2.4 Data Structures

The data used in this portfolio are all structured data due to the nature of the work. Examples of this structured data are tabular data extracted from corporate SQL server database or MS Access database and data processed as JSON arrays on the open data portal.

Some data used in the dashboard project could be considered as unstructured, for example the collection of images used to create the animated GIF, however there is an inherent structure to the collection in that the series of images is labelled with a sequential time stamp and relates to a single monitor. These data are really an intermediate step in a process to create a visualisation rather than a source data set.

In the conventional sense, and considering the definition of unstructured data when applied to source data for analysis, unstructured data such as collections of images, multimedia files or emails were not used in the project as they are not relevant in this domain. (K5)

2.5 Security Standards and Policies

The over - arching Information Governance Framework outlines roles and responsibilities, policies and procedures, along with best practice and standards for managing the Council’s information assets. This has been developed to take account of the standards set by external organisations, such as the NHS in respect of the transition of Public Health to the Council and the requirements of the Public Sector Network (PSN) Code of Connection (CoCo).

The framework consists of the following areas:

  1. IGF Principles
  2. The Information Governance Strategy.
  3. Appropriate Information Governance Responsibilities.
  4. Information Asset Ownership
  5. An Information Governance Structure.
  6. Effective Information Governance policies and procedures.
  7. An Information Asset Register
  8. An Information Risk Register
  9. Information Governance communication and training.

There are a number of policies and procedures in the framework which deliver the outcomes of the Information Governance Strategy, including:

  1. Instant Messaging Policy
  2. Acceptable Use Policy
  3. Training, Awareness and Development Procedure
  4. Agile or Teleworking Policy
  5. Logical Access Control Policy
  6. Physical Access Control Policy
  7. Information Security Incident Reporting Policy
  8. Subject Access Request Policy

All the policies and procedures are hosted on a “metacompliance” platform, which manages access and control of the policies and ensures that relevant staff have read and agreed to the policies.

All my work is compliant with the IGF and I liaise with colleagues in IG where necessary. Becuase much of my work is with environmental data the risks of infringing the IGF with personal data issues are limited. (K2)

3 My Role: Air Quality Project Manager

My official job title is “Project Manager”. This is currently under review, partly because the extent of actual project management activity is quite limited. The majority of my time is spent managing a network of air quality monitors and the data that arises from the network.

3.1 Key Deliverables

The deliverables for which I am responsible are summarised as follows:

  1. Data capture rates exceeding 85% for continuous monitoring, 75% for passive.
  2. Monthly data cleaning (ratification) of continuous data. (K3, K4, K8)
  3. Annual reporting of air quality data. (S12, S13, S14, S15)
  4. Calculation of Key Performance Indicators (KPIs) for air quality.
  5. Developing or revising KPIs as necessary. (K11, K13)
  6. Ad hoc analysis and summaries of air quality data to support other programmes.
  7. Ensuring all relevant air quality data are published on the Open Data Portal. (K4, S12)
  8. Ensuring all relevant technical guidance is followed in relation to air quality management. (S1, K1)
  9. Responding to relevant requests from stakeholders for air quality data and analysis. (S7)
  10. Delivery of specific projects such as “Slow the Smoke” (S12, S13, S14, S15)
  11. Responding to planning applications with an air quality dimension.

3.2 Key Skills and Knowledge

The key skills and knowledge for my role are as follows:

3.2.1 Knowledge

  • A good understanding of the legal framework for air quality management in the UK. (K1)
  • Familiarity with the relevant technical guidance on assessing air quality. (K2)
  • Knowledge of air quality policy and interactions with other domains like transport.
  • An understanding of the development management (land use planning) process in the UK including Environmental Impact Assessment (EIA).
  • An understanding of the principles of open data and legal framework for public access to data. (K4)

3.2.2 Skills

  • Processing and analysing medium sized (up to 10 million observations) data sets
  • SQL (SQL server, ODSSQL and MS Access) (K6)
  • Excel
  • HTML, Javascript and CSS for web development of open data products
  • R - packages openair, openaq, sf, timetk, fastverse, deweather and tidyverse are relevant to the air quality domain
  • FME and FME Server for automating web services and data integration
  • Time series analysis
  • Network telemetry: IP, analogue, wireless, Teltonika RMS
  • Technical report writing and comprehension of technical reports relating to the domain
  • Communication skills - ability to report technical information to non - specialists
  • Project management for small and medium sized projects
  • Technical skills related to installation, maintenance and quality control of air monitoring instruments
  • Negotiation skills for contract management and securing outcomes in the planning process

3.3 Strengths and Weaknesses

3.3.1 Strengths

  • Long experience in the domain and a good level of skills and knowledge
  • Strong work ethic
  • Motivation to deliver and improve services and air quality
  • Wide range of contacts internally and externally
  • Collaborative approach to working

3.3.2 Weaknesses

  • Wide, rather than deep data skill set
  • Limited understanding of statistical theory and advanced analysis
  • No exposure to team environment of other data analysts so lacking peer support
  • Isolated from corporate data analytics functions
  • Limited employer incentive to improve skills

3.4 Areas for Improvement

  • Improved understanding of statistical theory and learning
  • Power BI (corporate data analytics tool)
  • Azure (corporate data platform)
  • Python and PANDAS for comprehensive tooling of data science operations

4 Portfolio Projects

I have identified three projects which align with business and apprenticeship progress review objectives. They are focussed on creating data products which will enhance business processes and increase confidence in our data and analysis.

My four objectives for the apprenticeship are listed below. Three of the portfolio projects deliver these objectives.

Objective 1 Objective 2

Objective 3 Objective 4

I am able to work on these projects concurrently and I believe they are all achievable by the end of the apprenticeship.

I have created a GitHub repository for all the projects, and other work as part of the apprenticeship. This enables code sharing, version control and backup of work. I have found this very useful in being able to work on projects flexibly.

All the projects have been mainly implemented using R. I selected R as the overarching computing environment because I have greater experience in programming in R. Another key consideration is that R offers a powerful and comprehensive library for manipulation of air quality data, called openair, which is integral to the implementation of aspects of some of the projects and similar functionality is not available in other languages.

4.1 Combining Data in Projects

All the projects require data sets to be combined. The first project requires multiple sources of data to be joined with each other. This is usually achieved using dplyr::left_join() or dplyr::inner_join() functions, but the purrr::map_dfr() function is used extensively to iteratively stack multiple data sets when extracted from databases or combine list structures within tibbles. The dashboard project requires meteorological data from the open data portal to be combined with air quality data to generate visualisations such as polar plots. Data are combined by joining on hourly time stamps with dplyr join functions.

Because the sources of data used in the project are mostly under my control, the risks of combining are minimal as I understand the data characteristics and processes for sourcing. Selecting the appropriate field to join on and type of join is important to ensure that the resulting data is as expected. (K10)

4.2 Project 1: A Data Processing Pipeline for the Statutory Reporting of Air Quality Data

4.2.1 Introduction

Bristol City Council monitors air quality across the city and reports data to government under the Local Air Quality Management (LAQM) regime. This is a statutory duty under the Environment Act (1995) for councils to manage air quality to ensure the quality of the air meets legal limits for regulated pollutants.

In addition to this duty, there is also a requirement to provide monitoring and evaluation of the operation of a Clean Air Zone (CAZ) in Bristol. The CAZ is the main mechanism by which traffic pollution (nitrogen dioxide or NO2) will be reduced to comply with the legal limits.

The two reporting regimes are similar, but different in some respects. For LAQM, the reporting is directly to the government and takes the form of an Annual Status Report (ASR) to be submitted by June 30th of each year. In this ASR are summary tables of air quality data from all our monitoring which are required in a format determined by a template document. For CAZ reporting BCC is required to submit a spreadsheet of summary data in a set format to the Institute for Transport Studies at Leeds, who are the body that is conducting research on all the CAZs in England on behalf of the government. (K1, S1)

The aim of this project is to develop a process for reporting for both of these purposes that limits the need for manual data entry.

4.2.2 The LAQM regime

Local authorities are required to review and assess air quality in their areas and report to government annually. Where concentrations of regulated pollutants exceed legal limits “Air Quality Objectives” (AQO), councils must declare an air quality management area (AQMA) and publish a plan for dealing with the breaches of the AQOs.

Several urban local authorities experience breaches in the AQO for nitrogen dioxide, a traffic pollutant. The AQO is set at 40 \(\mu\)gm-3 measured as annual mean. In Bristol, this is the only pollutant for which exceedences are measured.

The relevant guidance for LAQM, LAQM.TG(22) prescribes the monitoring regime that must be followed for each pollutant, including how data is monitored, processed and reported. The process used in this project will adhere to the requirements of LAQM.TG(22). (K1, S1)

Further prescriptive guidance is provided by Defra for air quality reporting in the form of templates for the report itself and for the tables that should be included in the report. (K2)

For this project, the data process will focus only on tables prefixed “A” in the spreadsheet referenced. These are the tables that contain monitoring data. Other tables in the report require manual data entry as they need data that is sourced from other officers and departments in the council so will not be addressed in this pipeline.

In addition to these tables, there is a requirement to include thematic maps in the report which show concentrations of pollutants at monitoring sites. The process will also produce spatial datasets which can be used in a GIS to plot the maps or will produce graphical output directly for inclusion in the report as image files.

4.2.3 CAZ Reporting

The purpose of the CAZ is to reduce concentrations of one pollutant (NO2) to legal levels in the shortest time possible. Therefore the reporting is limited to NO2. However, the reporting on this single pollutant requires greater granularity than the LAQM reporting. This is partly because advanced techniques for analysing changes in time series air quality data will be used to assess the impact of a specific policy intervention (the CAZ) and therefore the hourly time series data need to be reported, not just the summary statistics. This analysis will be done by ITS using the `AQEval’ R package and the data for Bristol will be supplied by providing access to the open data portal dataset for continuous air quality data. (K11)

The reporting for diffusion tube data consists of providing data in a pre - formatted template quarterly. The data is a combination of meta - data for the measurement site, e.g. location, height, distance from a road and the “raw” monthly concentrations from the diffusion tube. The measurement technique for diffusion tubes and continuous analysers is described later on.

4.2.4 Analysis and Processing Techniques

Summary

The overall process was controlled using the targets R package. This is a “Make” - like pipeline tool for Statistics and data science in R to help maintain a reproducible workflow. Targets learns how the pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.

Each operation was assigned to a function. Targets ensures that the functions run in the right order, and only when all pre - conditions are satisfied. (K3, K8, K10, K14, S2, S4, S15)

Data Sources

The data needed for the analysis are in five separate repositories:

  1. A corporate Azure SQL server database (envista) hosting BCC’s continuous air quality data
  2. A Microsoft Access database for the diffusion tube data and monitoring site meta data
  3. The council’s open data portal
  4. UK Government’s Air Information Resource which holds data for the air monitoring sites run by government as part of the Automated Urban and Rural Network (AURN). There are two of these sites in Bristol.
  5. The LAQM Portal (holds data for the calendar for changing diffusion tubes)

Four of these repositories were used in the analysis. The open data portal was not used as it is planned to cease operation of the contract with the provider in 2023. This is somewhat sub optimal as the open data portal contains all the relevant data needed for the analysis and data is easily accessed using a powerful and well documented REST API. This project was reconfigured to avoid using the open data portal when I became aware that it was to be discontinued. (B7, B6, K10, S9)

Data processing overview

The data processing is split into two parts, because the reporting regime requires the use of a standard excel spreadsheet template for reporting. The first stage generates the input data for the template using a targets pipeline. The second stage takes output from the template spreadsheet and generates further files for inclusion in the final report, including charts as images, csv files for updating the open data portal and ESRI shapefiles for mapping in ArcGIS.

The data from databases was accessed using a combination of the DBI, dplyr and odbc packages (K6). Data from the LAQM Portal was retrieved and processed with the rvest package for web scaping. For the data that is hosted on the government’s website, the openair package provides a helpful function called importAURN. Functions written using a combination of tidyverse and base R and controlled by targets implement the ETL processes for stage 1. The dataframes are exported as csv files ready to be pasted directly into the spreadsheet template or as an Excel spreadsheet with individual worksheets corresponding to the specific tables.

For the generation of spatial output, the sf package is used to create shapefiles for use in ArcGIS and output graphics as png files.

Data Processing Approach

All custom functions are contained within the functions.R file which is called by _targets.R. The dates of interest are contained at the top of the file and are set by the user. These are beginning and end of the calendar year for reporting.

Diffusion Tube Calendar Table

Diffusion tubes are passive samplers that measure air quality (NO2) for a defined period. The period is determined by a calendar published on the internet and this must be adhered to so that data is consistent. The dates for the calendar are extracted using the web scaping library rvest in a two stage process. Firstly a function scrapes the two html tables from the defra web site.

Scrape HTML tables into list

Secondly, the relevant dates for the year of interest are selected, cleaned and output as a tibble.

Extract the year’s diffusion tube changeover dates

Several utility functions are used throughout the pipeline in other functions. They are not reproduced here, but serve to clean data, provide consistent themes for plots and process dates.

Database Connections and Security

Connections are made to corporate databases to retrieve data. To maintain security and comply with organisational data processing protocols, the connection details are not embedded in code. This would be a security risk, if for example code is shared on a public Github repo. The code below illustrates the use of the config packages to hold connection credentials in a config.yml file which is referenced by the function and added to a .gitignore file so that it is not published online. (K2, K6, S1)

Database connection function

Retrieve Continuous Data

Continuous hourly pollution concentration data from all current sites are retrieved from the envista database using the function below. The function takes the start and end dates supplied and a vector of the site id’s, which are supplied in the targets pipeline. The time base of interest for this process is hourly, but the function can be used to select a different time base, for example 15 minutes. Another tibble final_tbl is also supplied. This holds metadata about the monitoring sites, for example which channel (column) relates to which pollutant. The function iterates over a list of monitoring sites, collecting the data for the relevant time period and cleaning and returning a long format tibble with the consolidated data. (S4, S2)

Retrieve continuous data

Data from National Air Monitoring Network

In addition to continuous data from sites operated by Bristol City Council, we also use data from monitoring sites operated by UK government. These are know as AURN (Automated Urban and Rural Network) sites. The openair package is used to retrieve data from these sites using the function shown below. The code first tests to see if the data is available via the test.import.aurn() function. If so, it retrieves using the importAURN() function. If not, it uses an alternative source, importing by csv in another function (omitted for brevity).

Retrieve AURN data

In addition to local monitoring data from government sites, it is also necessary to get background data from within 50 miles to use to annualise incomplete data from sites where data capture is less than 75%. The function below does that, using the importMeta() function to filter sites by latitude and longitude in proximity to Bristol and returning data from these sites as a consolidated tibble. (S6)

Continuous nearby background data

Reshaping Diffusion Tube Data

After diffusion tube data are retrieved (code not shown for brevity) the data are reshaped for later insertion into the output spreadsheet.

During development of this pipeline, a mistake was made in the dates on which diffusion tubes were placed, i.e. they were not placed in accordance with the calendar. I sought guidance from the LAQM Helpdesk (a resource to assist local authorities in managing air quality). They confirmed that we could publish the data but that the incorrectly exposed concentration periods could not be used in the calculations for annual mean values. (B7)

This required adaptation of the code below and the MS access database, such that a new field was added in the table to flag if the data could be used in the analysis. The code below implements that - as noted in the comment. (B5, B6)

Pivot and filter diffusion tube data

I have instructed our member of staff to regularly review the diffusion tube calendar and plan their changeover dates accordingly as a result of this error.

Annualising Incomplete Data

When there is less than 75% data capture for a monitor, the data are adjusted using the background data collected above. To do this it is necessary to know the “background” concentration of NO2 at the site. This is given in terms of a 1km grid square identifier in the MS access database. The identifier can then be used to derive a concentration for NO2 in that 1km square for a given year. The function below receives the database connection, the start date and a vector of site ID’s which need annualising. It gets the subset of site ID’s which need to be, and can be, annualised and joins tables to return a table of background concentrations for the sites. The annualisation itself is carried out in the pre - formatted reporting spreadsheet. (K10, S4)

Background concentrations by grid square

Making Tables for the Reporting Spreadsheet

The reporting spreadsheet for the ASR requires data to be entered in a prescriptive format. The tables are numbered e.g. “Table A1” to “Table A9”. Not all the code to create the tables is shown, for brevity, but an example is given below. This is quite a simple example and populates a table of meta data for monitoring sites and includes a function to generate the required character string to describe the monitoring technology used.

Making a reporting table

Most of the output tables require a descriptive statistical summary in order to compare the measured concentrations with the relevant air quality objective. Examples of air quality objectives are shown below. They are usually an annual mean value or a number of exceedences of a threshold value per year, sometimes expressed as a percentile. These statistics are calculated in the R functions within the pipeline. (K13, S10, K14, S11)

Air quality objectives (extract)

List of Reporting Tables

All the reporting tables are added to a list in the function below, which takes an unspecified number of tibbles, adds them to a list, and removes the “NA”s generated when the contents are turned into character data types. This is so that the reporting spreadsheet does not contain multiple “NA”s.

Make list of tables

Write Spreadsheet

The list of tables generated in the function above (and others) is written to the final reporting spreadsheet using the function below. This uses the write_xlsx function to write each item in the list supplied to a separate worksheet in the spreadsheet.

Write Spreadsheet

Graphical Output

In addition to the data required in the reporting spreadsheet for LAQM, the ASR report also makes use of charts to show, for example, trends in pollutants. The code below is included in the pipeline and iteratively generates charts for pre - determined locations in Bristol. make.no2.trend.chart.tbl() takes tables of the annual concentration data for current and previous years, a tibble of areas to be plotted and the monitors in that area, and the monitor site metadata tibble. It generates plots for each area using ggplot and mapping over each area. Each plot is a nested object in the output table, which also includes a column with the file name of the output image.

The function at the bottom of the screenshot takes that tibble and uses the purrr::pwalk() function to iteratively save each plot as a graphics file in the path provided.

Plotting trend charts

An example of a trend chart for an area is shown below. (S14)

NO2 trend chart

Spatial Data Output

Because there is a strong spatial element to air quality data, the ASR contains several maps, showing how air quality varies across the city. The data processed in the pipeline is output as a shapefile to support mapping in ArcGIS software. R could be used for mapping but generally a higher quality mapping product is more easily produced using GIS. However the source data can easily be generated using the function below.

Write shape file

The targets Pipeline

All the functions shown (and more) are called and run in a pipeline governed by the targets package. This implements a “Reproducible Analytics Pipeline” whereby the dependencies are implicitly defined by the pipeline itself, and redundant code is not re - run. The names of the targets are the objects created, which are hashed and stored in /targets/objects as .rds files. Targets which have not changed since the last run will not be re - run. A truncated example of the critical targets pipeline is shown below.

Targets pipeline example

4.2.5 Project 1: Summary of Activities

In Project 1 I carried out the following activities.

  1. Defined the objectives for the project, namely
    1. Summary data tables for ASR spreadsheet in specified format
    2. Appropriate graphic output for ASR (S14)
    3. Updated spatial data for most recent year’s air quality data
    4. Summary data for CAZ reporting in specified spreadsheet format
  2. Developed code and functions using R to deliver objectives
  3. Adapted code and working methods to accommodate changes in requirements (incorrect exposure periods) (B7, B2)
  4. Adopted best practice regarding organisational data requirements to secure credentials (S9, B1)
  5. Investigated and implemented a reproducible analytics pipeline using targets (B5)
  6. Documented the code and analytics approach using literate programming (this Quarto document)
  7. Implemented version control using Git and GitHub to ensure reproducibility and tracking

4.2.6 Project 1: Learning and Reflection

The process of reporting annual air quality data has changed somewhat over the 25 years I have been doing it. There is a wide range of tasks needed and several different data sets to process. There is a great deal of potential for human error, and I have made mistakes in the past when processing this data. I thought it was a useful project to tackle as part of the L4 apprenticeship because greater rigour in the analysis process and a carefully designed automation process should minimise the errors.

Initially I designed the process to use the open data portal where we publish our air quality data, but as the project developed I became aware that the open data portal was to be decommissioned in 2023. I therefore revised the design to refer directly to the canonical databases to source data. (B6, B7)

I started developing the processing in one R script. This soon became unmanageable so I investigated options to help manage the process. I learned about Reproducible Analytics Pipelines and read the book: Building reproducible analytical pipelines with R. I adopted as many of the approaches recommended in the book as possible, and I feel that this has greatly improved the quality of the code and analysis used in the project. (B5, B6)

Air quality reporting is unfortunately somewhat constrained by the prescriptive and closed requirements of the government department (Defra) which governs the activity in local government. Because of this it is quite difficult for local authorities to share their data in a coherent and consistent way such that a single view of air quality across the UK is possible. If open data approaches were enabled for local authorities by Defra it would enhance understanding of air quality for the public and increase access for researchers.

There are undoubtedly improvements that I could make if I had more time. I would like to do more unit testing of the functions used and ideally build an R package to implement the functions I have developed. I would also like to improve the data management operations upstream of the reporting process, for example the database used to store diffusion tubes and monitoring site meta data is currently MS Access (2010). Moving this to the corporate data lake or a data warehouse would be more robust and enable integration with other data sets in the council. (B6)

Even though I have designed the analysis to be reproducible and accessible, I acknowledge that running the analysis requires a working to intermediate level of expertise with R, and domain specific knowledge. This is potentially a problem in the organisation as I am one of the only people in the council with expertise in R. Hence there is a risk to business continuity with only one person able to run the analysis (safely).

The council is centralising data engineering services, but this function is likely to be focussed on generic business intelligence reporting, not specific “niche” domains such as air quality. Hence it is likely that air quality reporting will remain in a specific team. I am training two colleagues in using R and specialist packages for air quality reporting to help mitigate the risk to business continuity of having only a single practitioner.

4.2.7 Project 1: Business Benefits

Project 1 clearly benefits the council’s air quality team by standardising and automating the statutory reporting procedures for air quality monitoring. Rather than having to laboriously manually collate data from disparate data sources and summarise in a tool such as Excel, the entire process can run reproducibly and efficiently with two processes in R.

4.3 Project 2: Data Products to Share the Results of Citizen Science Monitoring in the Slow the Smoke Project

4.3.1 Introduction

The Slow the Smoke (StS) project is a Citizen Science project funded by Defra’s Air Quality Grant. It aims to test engagement approaches based on citizen monitoring of air quality with a particular focus on emissions from domestic solid fuel burning (wood burning stoves etc).

Bristol City Council leads the project and I am the project manager. We have two partners, the University of the West of England (UWE), and Knowle West Media Centre (KWMC). UWE lead on the technical aspects of air quality and writing the final report. KWMC lead on the outreach activities. I developed and agreed with the partners a data sharing agreement to cover the management of personal data under GDPR that was collected by the outreach activities. This ensures private data is safeguarded in accordance with GDPR. (K1, K2, K15, S1)

The combination of citizen science, community engagement and behavioural surveys is intended to identify effective approaches to influencing behaviour in relation to domestic emissions to air.

Ten citizen scientists have self selected in the study area which is a ward in the city centre called Ashley. This ward was selected because we have evidence that there is a higher level of solid fuel burning than average.

The aim of this project is to produce data products which support citizen scientists and citizens in understanding air quality in their area. This includes: learning materials for workshops, a sensor dashboard and advanced analytics to assist with policy decisions.

4.3.2 Data Sources

For the sensor dashboard and learning materials for citizen scientists, the main data source was the online data sets for the low - cost sensors. The sensors used are SDS011 sensors that measure PM10 and PM2.5 as well as temperature and humidity. The measurement interval is five minutes, and the data are transmitted via domestic wifi to a data repository which provides a REST API. This API was used to regularly poll the data and re - publish on Bristol City Council’s open data portal.

The interactive polar plot maps also used data from the “reference method” (i.e. high quality) instruments operated by Bristol City Council and Defra. These were sourced from Bristol’s open data portal.

4.3.3 Automated Data Ingestion and Processing

Although the Sensor.Community organisation provides some mapping and interrogation tools for the data collected from the sensors, it was felt that Bristol’s open data platform offered several advantages for this project over the Sensor.Community offer as follows.

  1. A unified single point of enquiry for local air quality data
  2. Ability to integrate with other air quality measurements
  3. Interactive bespoke dashboard and visualisation options
  4. A powerful and well documented API for querying the data
  5. Integrated aggregation of data to hourly interval enabling direct comparison with reference method instruments

In order to satisfy expectation for the customers of these data, i.e. the citizen scientists and members of the public, it was decided that the data should be as current as possible. (K7, K4, K9) This meant designing a real - time “Extract - Transform - Load” (ETL) process to query the local raw Sensor.Community data from their API, transform it to a state suitable for publishing on our portal and then using the push API on our portal to publish the data.

The ETL process was first trialled using scripts in R. The httr library was used to develop functions to extract the data and push to the portal and data manipulation conducted using the tidyverse meta package. After successfully prototyping the solution, this was moved to production using FME (Feature Manipulation Engine).

FME is a no - code platform for ETL which is primarily focused on spatial data. Bristol City Council also has FME server which supports scheduled FME processes. Two FME workspaces were developed to implement the ETL of the sensor data. These are represented visually in the screenshots below as FME workspaces. (K10, K11, K12, S1)

Hourly FME ETL Process

Daily FME ETL Process

It was decided to split into hourly and daily processes because the latency in the Sensor.Community API process made it impossible to reliably downsample data into hourly intervals in a real - time process. The FME hourly process subsets the measurement data using a bounding box to just get Bristol’s data, unnests the PM10 and PM2.5 data from the json array and writes the data to a csv file. (S9, K10, S10)

The daily FME process then downsamples the data in the csv to hourly intervals, transforms it in various ways, and then writes the data to a json array and pushes to the open data portal and flushes data from the temporary csv file to prepare it for the next cycle. The daily process also automatically adds any new sites that might be established by a citizen scientist to the open data datasets for the sensors

The FME schedules run reliably and this has proved a robust method for automating ETL processes for the open data portal.

4.3.4 Developing the Data Products for Citizen Scientists

Making Sense of Data - Workshop and Learning Resources

It was recognised that citizen scientists would need some initial training and induction to familiarise themselves with the sensors and the data that was collected (S7, S12). A “Making Sense of Data” workshop was convened in July 2022 and interactive resources hosted on the open data platform were prepared to support the learning activity. (S7, S12)

The learning resources include:

  1. Interactive leaflet maps showing locations of the sensors
  2. Time series interactive charts (vega) showing hourly and daily sensor data
  3. Polar plots showing dominant wind speeds and directions for different levels of pollutants
  4. Animated gif images of polar plots showing how pollution varies with time and highlighting weather features that affect air pollution.
  5. A short video introducing the open data portal and explaining how to query, analyse, map and interpret the low cost sensor air quality data using the tools on the open data platform.

Making Sense of Data - screenshot 1

Making Sense of Data - screenshot 2

Making Sense of Data - screenshot 3

Making Sense of Data - video screenshot

The “Making Sense of Data” (MSOD) page was built on the open data platform using the templates and tools available on the platform. The development environment is a content editor which enables opendatasoft’s “widgets” to be embedded in HTML tags. “Contexts” are data sets on the portal that can be filtered, aggregated and processed either using javascript or the widgets provided on the platform (S9.

The overall structure of the page was quite simple. It was designed to act as a presentation medium, but be interactive to show the features of the platform. Some code is extracted and shown below to illustrate the composition of the page. This shows the KPI’s at the top of the page which summarise using mean (AVG) and count functions (K13, S10, S14)

Making sense of data: aggregation code

Making sense of data: Weather chart, time selection

On the MSOD page, the concept of “polar plots” is introduced. These are two dimensional plots that indicate the strength and direction of wind that brings pollution to the sensor site. This is a specialised chart provided by openair::polarPlot() and requires a data frame with wind speed, wind direction, date time and pollution concentrations to render. The detailed operation of the function is beyond scope here, but the output chart is useful to the non - expert in understanding the relationship between wind and pollution. In order to show the dynamic effect of wind on pollution, an animated GIF was developed and used in the MSOD page.

This analysis is descriptive in the sense that it provides valuable information as to the source of the pollution problem. In the making sense of data workshops, the polar plots are used to explain that the sources of pollution are largely regional, but that the plots do show some local sources of pollution which are amenable to local controls.

Firstly a polar plot was created for each day at a specific site. This results in around 90 images. Then a function using the magick library for image manipulation “animates” the multiple PNG files to create a single GIF. The GIF is simply published as an asset on the portal and called using the <img> tag. A sample of the R code to create the animated GIF is shown below.

Animated GIF code

Sensor Dashboard

In addition to the “Making Sense of Data” workshop and learning resources, it was important to provide an ongoing resource for the citizen scientists to interrogate their data as non - experts in environmental science. This was strongly endorsed by participants in the workshops. (S7)

The open data portal provides strong capability for visualisation and transformation by default. For each dataset there is a tabular view, which can be ordered and filtered by facet or user entered search criteria. Interactive maps and charts are also provided which can be similarly filtered and the filtered data can be exported in a range of common formats.

For experienced users, a REST API is available which offers some SQL - like syntax to manipulate and export the data. However, for this purpose, it was felt appropriate to develop a more curated data product than the default offer which was tailored to the anticipated needs of the customer. The dashboard should help to answer the following questions.

  1. What is the current (or recent) level of pollution at my sensor?
  2. What is the spatial context of my sensor?
  3. What do the readings mean in terms of relevant legal or health - based limits?
  4. How do pollutant levels change with time of day or day of week?
  5. How can I relate pollution levels to wind patterns or weather?
  6. How can I download the data if I need to do more detailed analysis on it?

(S7, S12, S13, S14, S15, B2)

A web dashboard was designed to answer these questions by implementing the following features. (S14)

  1. An interactive map on the first page where the user can select their sensor and go to the dashboard main page
  2. A simple coloured bar gauge graphic at the top of the page for each pollutant which relates the most recent concentration to the relevant air quality index - including a link to the method for determining the index.
  3. Descriptive text for the sensor and an interactive zoomable map showing the location of the selected sensor.
  4. A slider control to select the time period for subsequent visualisations.
  5. Tabs, providing five different visualisations including:
  6. Time series chart for both pollutants (selectable)
  7. Daily mean chart including relevant World Health Organisation (WHO) guideline values for comparison
  8. Charts for hour of day and day of week - to show diurnal and weekly variation
  9. A wind rose for the selected time period, which shows wind speed and direction
  10. A data table with download links in Excel, JSON and CSV formats

Sensor Dashboard 1 Sensor Dashboard 2

The sensor dashboard was also built on opendatasoft’s platform. Some sample code is shown below. Firstly the gauges that compare the most recent readings from the sensors with the air quality index for the UK. These use the ods-gauge widget and colour the gauge bar using CSS (not shown).

Sensor dashboard code: gauge

The code below uses some simple javascript string manipulation to extract latitude and logitude from the data set.

Sensor dashboard code: location (lat  lon)

The code below is typical use of the ods-chart widget which implements a vega interactive chart to display the data.

Sensor dashboard code: chart

The next code fragment returns a wind rose image from the Iowa State Mesonet by dynamically sending the string assembled in the javascript code as a request to the REST API (S14). The date period is determined by the date time selector widget. The weather station is hard coded to be the closest weather station with publicly available weather data for Bristol - Lulsgate airport.

Sensor dashboard code: wind rose

Technologies used in Data Products

Both the “Making Sense of Data” learning resources and the sensor dashboard were built on the Opendatasoft platform using the components available. These were:

  1. Web development using HTML and CSS
  2. Angular JS for logic
  3. Opendatasoft’s widgets for visualisations, aggregation, iteration and controls
  4. Javascript for string and data manipulation

(K10, S2, S7, S12, S14)

In addition to these technologies, javascript was used to manipulate a url in order to implement the wind rose. The wind rose image is rendered by sending a request to a resource at https://mesonet.agron.iastate.edu. Depending on the query within the request, a wind rose will be returned for the requested time period.

4.3.5 Advanced Data Products for Project Reporting

The two data products cited above are helpful for citizen scientists trying to understand their sensor data. The Slow the Smoke project is also intended to help other practitioners in the air quality field to understand the issue of air pollution arising from the specific source of domestic solid fuel burning, for example open fires and solid fuel stoves. This is a serious and growing problem for public health.

In order to provide deeper insight into this issue, further analysis was conducted on the data from the Slow the Smoke, Sensor.Community and reference method sensors in Bristol.

Openair Polar Plot Maps

The openair package in R provides open source tools for the analysis of air quality data. Several bespoke visualisations are available to reveal patterns in air quality data such as summary plots, heat maps and polar plots, as used in the “Making Sense of Data” materials. A separate package openairmaps implements polar plots superimposed on an interactive Leaflet base map.

The openairmaps package has been designed to create interactive HTML air quality maps. These are useful to understand the geospatial context of openair-type analysis, and to present data in an engaging way in dashboards and websites.

The advantages of visualising air quality data in this way are as follows:

  1. Where many sensors are in the same location, the dispersion patterns can be inspected for consistency
  2. Data can be binned by, for example, time period and season to help reveal likely source sectors
  3. If reference method and low cost sensors are used on both maps, the relative difference in scale of measurements can be highlighted

Despite these advantages, care must be taken not to misinterpret the maps. They are intended to indicate patterns of dispersion and source direction, not absolute values or comparison with objectives. This is why they are not deployed as a resource in the sensor dashboard. (S14)

The data were prepared by first filtering the available Sensor.Community sensor data for data capture. This is necessary because the surface fitting algorithm in openairmaps requires a certain level of data availability. Reference method data are downloaded from open data resources and joined to the low cost sensor data and then wind speed and direction data from Bristol airport is joined by timestamp to the air quality data. The data are binned using the cutData() function in openair to create data which is used in a selector control. This dataset is fed into the polarMap() function for each pollutant to produce the map.

Prepare data for polar plot map

Function to plot polar map

The interactive maps are published by rendering them as Quarto documents and publishing on the quarto-pubs website.

An example of a static version of the maps is shown below. (S14)

Example of static PM 2.5 polar map by cut by season

4.3.6 Project 2: Summary of Activities

In Project 2 I carried out the following activities

  1. Developed, tested and deployed to production an automated ETL process for real - time sensor data
  2. Developed and delivered learning material for citizen scientists for the Slow the Smoke project including:
    1. An interactive open data based web page integrating various data sets
    2. A short video showing how to use the open data portal
    3. Advanced domain - specific visualisations (polar plots) to explain air pollution
  3. Developed and published a bespoke sensor dashboard including:
    1. Maps, charts and user - selectable time periods for subsetting data and visualisations
    2. A wind rose image using a REST API
    3. Downloadable sensor data in a range of formats
  4. Developed and published a Polar Maps interactive data product to explain the temporal variation in pollution with a compelling visualisation which also gives insight as to sources and potential mitigation
  5. The data products listed above are included in the final report to the project funders (Defra)

4.3.7 Project 2: Learning and Reflection

This was an interesting project that included a range of technologies and approaches to deliver successful products. I had some experience with development of dashboards and visualisations on the opendatasoft platform, but this project extended that, and I learned some new techniques, including how to use a third party REST API URL on the platform. However I would say that my skills with front - end technologies like javascript and Angular JS are still quite limited. The web resources I built could have been improved if I had more time to adjust the CSS and HTML in the pages, however I think the quality was satisfactory for the audience.

I enjoyed the process of developing the ETL pipeline in FME and I think that worked quite well.

The MSOD learning material was well received when I presented it to citizen scientists in July 2022 and the polar plots used on the page are an eye - catching visual device that definitely help to convey what is happening with air pollution under differing weather conditions (S14). It would not have been possible to create the polar plots using any other software than R. Knowledge of the the openair package was very helpful in this instance. (S5)

The sensor dashboard was interesting to build and a good example of the powerful and flexible options available on opendatasoft’s platform.

4.3.8 Project 2: Business Benefits

The data products developed were key deliverables for the Slow the Smoke project. The extract below is from the grant application for the Slow the Smoke project.

Data from all of these air quality monitors will be integrated into a location – specific dashboard so that residents have a single view of air quality in their area. The dashboard will deliver compelling mapping and visualisation of trends, statistics and Key Performance Indicators (KPI) and will be curated by air quality officers in Bristol City Council. The dashboard will be developed with the community and tested to ensure it meets community needs.

The dashboard will be a key point of reference in the project for factual data on air quality with its structure and content being influenced by input from the KWMC citizen engagement process. The limitations of the Luftdaten devices in terms of accuracy will be clearly explained.

Public engagement was a key strand in this work and the MSOD page and sensor dashboard were helpful in explaining data to the public. The data products developed in the project demonstrated to funders that BCC is a reliable project partner and satisfied the conditions of the contract. No additional costs were incurred for the project as all the data products were built using existing infrastructure or free and open source software.

4.4 Project 3: Comparing Performance of low - cost Sensors with Reference Method Instruments

4.4.1 Introduction

The Slow the Smoke Project is introduced in Project 2. It includes a significant “citizen science” component.

The citizen scientists have each been given a “low cost” air sensor that monitors particulate matter (PM). PM is fine dust in the air, including smoke. There are two fractions of PM that are important for health; PM10 (aerodynamic diameter < 10 μ) and PM2.5 (aerodynamic diameter < 2.5 μ). The devices deployed monitor both of these fractions using a light - scattering technique, where the diffusion of laser light is a function of the concentration of PM in the sampled air.

Because these devices do not directly measure concentration of PM, but use a proxy measure, they are not as accurate as “reference” measuring instruments used to assess compliance with air quality objectives. It is therefore necessary to attempt to characterise the performance of these devices in relation to approved or “reference method” devices by co - locating the low cost sensors with reference method instruments and comparing measurements. This was a stated aim in the grant application for the Slow the Smoke project.

In addition to the community sensors, we will install a reference/equivalence method instrument in the study area to validate measurements through co – location with one or more Luftdaten devices.

The installation of the reference method equipment is ongoing, but co - location was done with existing reference method sites.

The characterisation of performance of the low cost sensors was extended through predictive analytics to develop a suitable model which can be used to predict concentrations of particulate matter to an accuracy approaching that achieved by reference method instruments using signals from the low cost sensors.

In the field of air quality, predictive and prescriptive modelling is usually carried out using dispersion models which are based on a gaussian distribution of pollution within a plume. They also often include algorithms for calculating the effects of atmospheric chemistry for pollutants such as ozone and oxides of nitrogen. Source terms in the model are meteorological data and activity data such as industrial processes or pollutant emissions from vehicle movements. I have used these models in the past to predict concentrations of regulated pollutants across the city, and to test interventions such as a low emissions zone. (K14, S11)

The modelling in this project was not intended to predict concentrations from source terms such as activity and meteorology, but to predict more accurate concentrations of particulate matter than available from an untreated low - cost sensor.

4.4.2 Co - location Study

There are three monitoring sites which measure PM in Bristol. A map and summary data for the two sites used sites are shown below.

Map of monitoring sites used in study

Co - location: Potential Sites
Site description Pollutants
AURN St Pauls - 452
The monitoring station is located within a self contained air conditioned unit within the car park of a day nursery. The monitoring station is approximately 30 metres south east of Wilder Street which is a lightly trafficked urban back street. The surrounding area is primarily residential with some commercial premises in the immediate vicinity. PM2.5, PM10
Parson Street School - 215
The Parson Street School monitoring site has been operating since 2002 and was selected as a roadside site that represents residential exposure to relatively high concentrations of traffic generated pollution. The sample inlet is approximately 1 metre from ground level and four metres from the kerb of Bedminster Road where traffic queues for the traffic lights PM2.5
Temple Way - 500
Roadside monitoring station beside multi-lane section of the A4044 (Temple Way). This is a major route in and out of Bristol city centre. PM10

The co - location study used two of these sites; Parson Street School (site 215) and Temple Way (site 500). These sites are both operated by the council. Using the AURN St Pauls site would have required permissions from the Environment Agency which would have been a time consuming and uncertain process.

The airrohr SDS011 fine dust sensors require a wifi signal in order to push data to a server. The Bristol City Council sites did not previously have wifi access available. The telemetry at our air monitoring sites was a combination of 3G modems and analogue land lines in 2021.

In order to accommodate the co - location study and also for the purposes of virtualising our data collection machine, I procured, configured and installed Teltonika RUT950 or RUT955 4G LTE routers at all of our monitoring sites. This required me to learn about network protocols, network security and the configuration of windows updates. (B7)

I liaised with our network team to configure white lists for the Envista ARM application to access the routers. This enabled 4G TCP/IP access to the data loggers or instruments at all of the sites, and also provided a wifi hotspot to enable the SDS011 sensors to send data. (B7)

The physical installation of the SDS011 sensors at both sites was complete in early May 2022.

Colocation Study: Concept and Objective

The aim of the colocation study is to compare the performance of the low cost sensors with the performance of reference method instruments measuring the same pollutant. The method of implementing this comparison was to collect hourly data for the two co - located devices and establish the linearity of the response using a model to report coefficients and r2.

In addition to simply describing the relationship between the low - cost and reference devices, the model was used to predict reference method values (i.e. “real” concentrations from low - cost sensor readings).

Some of the methods for testing the performance of the sensors have been adopted from the United States Environmental Protection Agency (USEPA) report on the Performance Testing Protocols, Metrics and Target Values for Fine Particulate Matter Air Sensors.

Within this study it was not possible to compare the responses of multiple low cost sensors with each other as there was not a budget to purchase additional devices for this purpose.

Reference Method Equipment

Continuous Ambient Air Quality Monitoring Systems (CAMS) are certified by the Environment Agency under their MCERTS scheme. This certifies the quality regimes and equipment for environmental permit holders. Local authorities are required to use MCERTS (or equivalent) equipment for monitoring air quality under the LAQM (Local Air Quality Management) regime. The certification and approval process ensures that the measurement standard provides data of an acceptable quality.

In addition to the specification of the equipment, a local authority is required to adhere to the calibration and maintenance requirements for the equipment as set out in the relevant guidance LAQM.TG(16) and the Local Site Operator (LSO) manual for sites that are part of, or affiliated to the national AURN (Automated Urban and Rural Network).

(K2, K8, K9, S1)

The equipment used in this colocation study is the Met One BAM 1020 Continuous Particulate Monitor, hereafter referred to as “BAM 1020”. The instrument works by drawing a sample of air through a filter tape every hour. The deposited PM is then exposed to a source of radioactive Carbon 14 on one side of the filter tape. A beta radiation detector is on the other side of the tape and measures the attenuation of the beta radiation through the sampled filter. The attenuation of the beta radiation is a function of the deposited PM mass on the filter tape. Because the flow rate of the sampled air is known, the concentration in μgm-3 can be calculated. Hourly concentrations are recorded on an external data logger and polled by a central telemetry system which holds the data on a SQL Server corporate database. (K6)

Colocation Sites: Temple Way

The Temple Way site is affiliated to the national monitoring network. PM10 and NOx are measured at this site; summary metadata are provided below.

Colocation Sites: Parson Street School

Oxides of Nitrogen (NOx) have been measured at Parson Street School for many years. The enclosure is close to the roadside of a busy, queuing road and represents exposure of school children and school staff. In recognition of the need to understand exposure to PM2.5 at a roadside site the monitoring station was updated with a BAM 1020 in 2021. The BAM 1020 when configured for monitoring PM2.5 includes an additional particle size cut off filter and also incorporates a heated inlet to drive off volatile compounds from the sampled air. The summary metadata for the site is shown below.

Colocation Configuration

The low cost sensors were co - located inside the cages of the monitoring sites. Parson Street was installed on 29th March 2022 and Bristol Temple Way was installed on 1st May 2022. The photographs below show the detail of the colocated devices at each monitoring site.

Temple Way

Parson Street

Data Sources

Data from one low cost sensor and from both reference instruments are published in near real time on the council’s open data portal. The BAM 1020 data are available through the air-quality-data-continuous dataset and the Parson Street data are available through the luftdaten_pm_bristol dataset.

This is a dataset that is a geographical subset of the sensor.community archive focussed on Bristol. In addition, data are aggregated to give an hourly mean value for both PM10 and PM2.5. Custom functions were written in R to access these files and import the data as data frames. The data for the low cost sensor at Temple Way are not published on the council’s open data portal. This is because the sensor did not register properly on the sensor.community portal. The data are however available through a combination of online .csv and .zip files. (K10, B7)

Data Processing Pipeline

Initially the data processing pipeline was split into three R scripts, one for retrieving and preparing the data, one for plotting, and one for modelling the data. However, while developing this approach, it became apparent that there were significant problems with this method. Tracking the operations across three different scripts became confusing, and testing and debugging was sub - optimal and resulted in errors when running the scripts if they weren’t called in order. Researching this issue identified a possible solution, which was eventually used for this project (B5, B7). The targets R package is a “Make” - like pipeline tool for data science in R. It helps to maintain a reproducible workflow, minimising repetition, running only necessary code, and forces a functional approach to code and pipeline development.

(B7, B6, S15, K11)

One advantage of using targets is that the entire processing pipeline can be viewed as an interactive graph using the tar_visnetwork() function. This enables a quick overview of functions, targets etc and helps understanding of the process. An extract from the pipeline visualisation is shown below, highlighting the model development part of the pipeline.

Visual network graph of data processing pipeline

Retrieving and Preparing data for Modelling

Data for the two BAM 1020 instruments and the Parson Street low cost sensor are retrieved using a purpose built R function which uses endpoints from the Opendatasoft API. This is sourced from a script outside the project and is shown below.

Import ODS function

For the Temple Way sensor a different approach was developed to retrieve the data, which is available through the Madavi API as a combination of csv and zip files. This operation entails retrieving a number of files in different formats and joining the data into one data frame. The function which retrieves the monthly .zip file urls is shown below.

Zip file download function

Once data are retrieved from the online repositories, they are combined into a nested data frame or “tibble” with functions from the purrr package such that columns are created that can be used in the subsequent modelling process. The resulting nested tibble is grouped by the site ID with list columns containing the raw hourly data, the prepared source data for the model, and a pivoted (wide) data set used in the plotting and modelling functions.

Nesting the model data enables a concise iterative approach to developing the modelling pipeline and means that the modelling process itself generates a data frame with list columns that contain the output of the regression model, as well as plots which explain the data, and model output.

Although data from the reference method instruments are collected at hourly frequency, and the data from low - cost sensors are collected at five minute frequency (then downsampled to hourly) it was decided that the modelling exercise would use daily data. This is to reduce the large variation in hourly data and also to reflect the legal objectives for these pollutants, which are largely set in relation to daily rather than hourly periods.

Plotting

Functions were developed to create a range of visualisations of both the raw and processed data and model outputs. Not all function code is shown here, but some examples are presented.

  1. plot.drift.site.gg() is a function based on the ggplot2 library which takes the prepared model data tibble and a site ID and returns a ggplot object (plot) showing how the pollutant concentrations from the low cost sensor diverge from the reference method measurements over time. The plot uses the geom_smooth() geom with the smoothing set to “gam” in order to show a smoothed trace.

Drift plot function

  1. plot.scatter.site.gg() is another ggplot based function which generates a scatter plot for co - located instruments. It also incorporates functions from the ggside library (extension for ggplot2) which enables side - plots for the scatter plot showing the distribution of data points for each axis. Further it includes stat_cor() and stat_regline_equation() from the ggpubr library to show correlation statistics and the equation derived from the linear model on the chart. This function therefore provides much useful information about the relationship between the two data sets on a single chart.

Scatter plot function

  1. prep_timeplot() is a helper function that takes the prepared data for modelling and processes it for use in another function. The function subsets the data, renames columns, pivots longer and pads the time series such that the data is ready for the plot_time_series() function.
  2. plot_time_series() takes the data from step 3 above and a string representing an interval (e.g. “hour”) and plots a time series chart using the ggplot2 library.
  3. Other helper functions which format and save plots.
Exploratory Data Analysis

The colocated data were first plotted using the summaryPlot() function from the openair package. This gives a schematic overview of the time series, distribution, and summary statistics for the data. It can be immediately seen that the distributions differ, with the low cost sensor data being markedly more right - skewed than the reference data, but with lower overall concentrations than the reference data (K13, S10). It is also notable that significant quantities of data are missing from the reference PM10 instrument at Temple Way. This is because data were removed by the Environment Agency due to data quality issues. This is unfortunately likely to affect the performance of future modelling (K8, S6).

Summary plot: PM10

Summary plot: PM2.5

Further exploratory data analysis (EDA) with time series charts showed that the data from the low cost sensors diverged significantly from the reference instruments. The plot below shows daily mean concentrations at both sites. (K8, S6)

Daily mean time series chart

The divergence does not appear to be uniform or consistent and seems to drift over the study period, so drift plots were developed to examine this artefact. Drift is significant, and different at each site and for each pollutant.

Drift plot: PM10

Drift plot: PM2.5

Scatter Plot and Linear Regression

Scatter plots were made of the co - located data at each site to visualise the hourly data and distributions alongside each other. Model coefficients are displayed on the charts.

Scatter plot: PM10

Scatter plot: PM2.5

Modelling

The relationship between the low - cost and reference sensors should ideally be linear. It is therefore reasonable to identify a linear regression model as appropriate for this task. However, other models may also be useful, so in addition to the linear model, an xgboost model was also deployed to compare with the simple linear model.

The EPA report indicates that the low - cost sensors may be affected by ambient humidity levels. This is also borne out by research by Liu et al (2018). Hence it was decided to include humidity as a term in the model selection process, as well as temperature as another available variable that may affect the relationship.

In order to compare the two models (xgboost and linear regression) using a combination of three regressors (low - cost readings, temperature and humidity), a function was written to:

  1. Split the data into training and test sets using rsample::initial_time_split() function
  2. Run the selected combinations of models and regressors on the training split and store in a list.
  3. Using the model metrics functions in the yardstick package, make a table of model metrics to judge the performance of the model / regressor combinations based on the performance of the models when applied to an unseen testing split.

The results of this analysis are shown below.

Model selection table

It can be seen that the model and regressors that minimised RMSE and maximised R2 for the PM2.5 instruments were the linear model (lm) using just the low - cost data as the independent variable and humidity. Including temperature reduced MAE to a minimum. For PM10 the combination of linear model and low - cost + humidity reduced all error metrics to minima and maximised R2 when compared to all other options. Hence it was decided that to maximise model performance the optimal modelling approach was to use a linear model with the low - cost readings and humidity as predictors for the reference signal. It may have been possible to improve on the performance of the xgboost model further by parameter tuning, but for the purposes of this exercise, the linear model seemed satisfactory.

The model data tibble previously created is used as input to a function to run the linear model on the data for each site. A script to import custom themes for plots is imported at the top of the _targets.R file. The make.selected.model.output.tbl() function takes the data tibble and creates a new output tibble with the following features:

  1. Model input data (time stamped daily concentrations and humidity data for entire period)
  2. A temporal split object for training and testing the model
  3. Model objects for the training and full data
  4. Tidied datasets (estimates, standard error and p.values) for training and full data
  5. Glanced datasets (r squared, p.value and number of observations) from training and full data
  6. Augmented data - combined observed test data and predicted (fitted) data from the trained model
  7. A check_model object, visually summarising the model’s performance for training and full data
  8. Prediction plots showing the predicted daily values from the model for the test split using the augmented data
  9. A performance summary table for the training and full data

Modelling function

4.4.3 Results

Model Check plots

The eaststats metapackage in R contains a package called performance which includes includes the check_model() function. This provides a comprehensive set of plots in a single output which enables the inspection of the model’s performance. The model check function was run on linear models fitted to both the training data and the full dataset. Results are shown below for the full model run for each site.

Model check plots: Full data set, site 500 (PM10)

Model check plots: Full data set, site 215 (PM2.5)

Inspecting the plots it can be seen that the models perform reasonably well according to the analyses in the check_model() function. The linearity of residuals is good until very high concentrations are reached. This also holds for the homogeneity of variance. Colinearity of predictors is low for both models. Normality of residuals is good for both sites, although at site 500 there is an interesting artefact in the first positive standard deviation where residuals start to diverge from normality slightly. Overall it seems that the linear model performs well at both sites.

(K14, S11)

Summary Tables

The performance metrics for each model are extracted and tabulated into a range of formats for display. the gtsummary package produces a nicely formatted html table and the performance package from the easystats suite of statistical tools is used to generate summary tables.

Model Coefficients

The model coefficients are shown in the tables below for the models fitted to the training data and the full datasets. Note that the coefficients relate to both predictors (low - cost and humidity) and therefore are different to the coefficients shown on the exploratory scatter plots, which just include the low - cost term.

Model Coefficients (Training)

Model Coefficients (Full)

Model Performance

The key metrics for the models are shown in the table below. (K14, S11)

Model Performance Summary Table (Training)

Model Performance Summary Table (Full)

Model Predictions

The models can be used to predict data in the test split. The fitted data are compared with the observed (reference) data in the test set with a time series plot for the daily mean concentrations. The test set is isolated from the training data so this is a true test of the model on “unseen” data.

Model predictions on test data: PM10

Model predictions on test data: PM2.5

It can be seen that the model predictions (.fitted) are quite close to the observed reference data at both sites. However the modelled concentrations are not acceptable in a regulatory application such as air quality management. (K14, S11)

4.4.4 Project 3: Summary of Activities

The activities I undertook in Project 3 were as follows.

  1. Planned an environmental monitoring exercise to evaluate data quality from low - cost sensors as part of a funded project
  2. Installed sensors and configured telemetry at two sites, liaising with relevant stakeholders to secure permissions
  3. Developed a reproducible analytical pipeline (RAP) in R using the targets package to implement the following processes
    1. Get and combine data from a range of sources using R functions
    2. Conduct EDA to understand data in the context of the domain
    3. Use visualisations to understand the relationships within the data
    4. Test a range of predictive analytics approaches and select an appropriate model (K14, S11)
    5. Fit and evaluate the model on training \ testing data sets
    6. Finalise the model and visualise predictions
  4. Critically evaluated the approach and suggested improvements

4.4.5 Project 3: Learning and Reflection

This project provided me with my first “real - world” exposure to predictive analytics using regression and machine learning models. The growth of low - cost instruments for sensing air quality in recent years means that there is a significant need to understand their performance when compared to high quality instruments, and this project has enabled me to gain a strong understanding of this and to be able to speak with confidence on this issue when advising stakeholders.

The logistical challenges associated with implementing the monitoring campaign are significant. Data capture was an issue, both for the low cost sensors, and the reference instruments. The quality of the study could have been significantly improved if the monitoring had been done for at least a calendar year as the seasonal effect could have been modelled. In addition a greater range of regression models could have been tested to potentially improve the performance of the models. (K14, S11)

If the project were extended I would like to develop the modelling approach into a web application to provide citizen scientists with a tool to adjust data from their sensors to show a “real” (reference) concentration. This would extend the work we have done in our team to explain the performance of low cost sensors and guide potential users of this technology.

In this project I used a combination of the tidymodels and easystats metapackages in R. I feel that I did not fully explore or understand the optimal workflows for modelling in these packages and there is certainly potential for improving my understanding of predictive analytics processes. (S13)

I again used targets to implement the RAP, and I think this was helpful in iterating the project. I have developed a workflow and conventions that seem to work well and should enable others to revisit the work and understand the code.

I could have used Python for this project, but I feel that I was able to make progress more quickly with R as I am familiar with it.

4.4.6 Project 3: Business Benefits

The co - location element of the Slow the Smoke project was a requirement of the grant funding and the predictive analysis of the data summarised in this project has been included in the final report to the funding body (Defra). In addition to discharging our requirements for the project, the work has reinforced understanding of the performance of low - cost sensors measuring particulate matter. As a team, we are often approached by manufacturers and suppliers to use or endorse these products, so a clear understanding of performance helps to evaluate these approaches.

4.4.7 Conclusions

This exercise used a simple linear model to establish the relationship between a reference method instrument and low cost sensor for two different measures of particulate pollution (PM10 and PM2.5) at two different sites in Bristol. Humidity was identified as an important term to include in the model, and helped improve model performance. Coefficients for the models were different depending on the pollutant and site. Nevertheless, the models performed reasonably well, given the limited availability of data. (K14, S11)

The use of low cost sensors in air quality measurement is growing, and large research projects have been established to assess the performance of the sensors and identify machine learning algorithms which can treat the sensor output to improve accuracy when compared to reference method instruments. The work presented here does not approach the rigour in the large research studies, but does indicate that a simple linear model which accounts for humidity may help to improve the accuracy of the low cost sensors. This may be a suitable treatment for the data from low - cost sensors in the context of a non regulatory citizen science project such as Slow the Smoke.

It was unfortunate that a significant subset of data were lost through a ratification process at Temple Way from 10/05/2022 to 25/07/2022 as this significantly reduced the training data available for one model and therefore its potential performance. It is likely that the performance of the modelling process could be improved with a longer time period and more colocation sites. In particular it would be advantageous for the study period to capture an entire calendar year in order to have data representative for all the seasons and weather conditions therein.

A greater range of models could be compared rather than just linear regression and xgboost, and hyper parameter tuning could improve performance of machine learning methods. In addition, although the data modelled were time series, the modelling approach did not incorporate any time forecasting methods. This was partly because of the limited data available, as not all seasonal data were available. Nonetheless, in relation to air quality data, the key influences arise from meteorological conditions and traffic movement (for the sites studied), and this work focused on modelling the relationship between two measured quantities rather than the output of a complex process per se. Hence this quite simple modelling approach seems appropriate for the purpose.