Summative Portfolio: L4
1 Introduction
My name is Steve Crawshaw. I am a project manager for Bristol City Council and I have undertaken the L4 Apprenticeship (Data Analyst) to improve my knowledge and skill in data analytics with a view to undertaking the L7 apprenticeship and achieving a role as a data scientist.
I have worked for Bristol City Council (BCC) since 1998 in essentially the same role, although I was seconded to another organisation between 2013 and 2016. My main role now is managing a network of air quality monitors and the data that they generate. This will be the subject on which I will focus for the summative portfolio.
2 Employer: Bristol City Council
Bristol City Council is a large unitary local authority in the South West of England.
2.1 Goals, Vision and Values
Bristol City Council’s Corporate Strategy outlines a vision of driving an inclusive, sustainable and healthy city of hope and aspiration where everyone can share the city’s success. It also describes the activities required by law.
The Corporate Strategy’s main priorities are informed by 5 key principles.
- Development and delivery
- Environmental sustainability
- Equality and inclusion
- Resilience
- World-class employment
It’s also arranged around 7 main themes:
Children and young people: A city where every child belongs and every child gets the best start in life, whatever circumstances they were born into.
Economy and skills: Economic growth that builds inclusive and resilient communities, decarbonises the city and offers equity of opportunity.
Environment and sustainability: Decarbonise the city, support the * recovery of nature and lead a just transition to a low-carbon future.
Health, care and wellbeing: Tackle health inequalities to help people stay healthier and happier throughout their lives.
Homes and communities: Healthy, resilient, and inclusive neighbourhoods with fair access to decent, affordable homes.
Transport and connectivity: A more efficient, sustainable, and inclusive connection of people to people, people to jobs and people to opportunity.
Effective development organisation: From city government to city governance: creating a focussed council that empowers individuals, communities, and partners to flourish and lead.
Bristol City Council is currently a mayoral - led authority. The current Mayor, Marvin Rees has set out values for the organisation as shown below.
2.2 How BCC Uses Data
BCC is a large and complex organisation dealing with a wide range of functions, from managing highways and planning applications to looking after vulnerable people. It follows that multiple systems and approaches exist for managing data across the organisation, many of which have evolved over time and have not been centrally planned or managed.
A recently published “Data, Insight and Information Strategy” sets out the strategic direction and objectives for the council in this area as follows:
Objective 1 - Create the right culture and environment for great collaboration amongst teams and partners
Objective 2 - Become an insight - rich data driven organisation to improve performance
Objective 3 - Manage our information safely, securely and appropriately
Objective 4 - Create a more efficient and effective approach to the use of data
Objective 2 - Establish the supply and use of data as a key enabler of a more connected, smarter city
The strategy aims to deliver these objectives by developing a single analytics team to deliver insights across the organisation and by consolidating the disparate and distributed datasets across the council into a corporate data lake with analysis being done on one data analytics platform (Power BI). Several projects are under way to deliver this change. (K2)
2.3 Data Architecture
No specific document exists about BCC’s data architecture, however there is an information asset taxonomy, which is summarised in the diagram below. Only the “people” element is shown. This was developed in 2017 by an officer who has now left BCC so the information may be somewhat outdated.
Much of this data architecture is not of direct relevance to my work as it pertains to other departments and functions. In my own work area, the main data architecture which I design, maintain and operate is summarised as follows:
- 182 passive samplers (Diffusion tubes) providing monthly concentrations of pollutants (NO2)
- MS Access database (K6) to hold diffusion tube and monitoring network meta data
- A network of 8 real time air quality monitors (analysers)
- A fleet of 4G routers providing telemetry to connect these devices to:
- A proprietary communications software suite (Envista Commcentre)
- A database client and analytics program (Envista ARM) (K6)
- A SQL server database hosted on Azure (K6, S9)
- An open data (OD) platform provided by Opendatasoft (K4)
- FME and FME Server processes to Extract, Transform and Load (ETL) air quality data to the open data platform (K10)
- Dashboards, visualisation and analytics delivered through the OD platform (S14)
- Bespoke reporting and ETL pipelines delivered through R, sourcing data through the OD portal and the Envista database (S14, S10, S2, K14, K7)
2.4 Data Structures
The data used in this portfolio are all structured data due to the nature of the work. Examples of this structured data are tabular data extracted from corporate SQL server database or MS Access database and data processed as JSON arrays on the open data portal.
Some data used in the dashboard project could be considered as unstructured, for example the collection of images used to create the animated GIF, however there is an inherent structure to the collection in that the series of images is labelled with a sequential time stamp and relates to a single monitor. These data are really an intermediate step in a process to create a visualisation rather than a source data set.
In the conventional sense, and considering the definition of unstructured data when applied to source data for analysis, unstructured data such as collections of images, multimedia files or emails were not used in the project as they are not relevant in this domain. (K5)
2.5 Security Standards and Policies
The over - arching Information Governance Framework outlines roles and responsibilities, policies and procedures, along with best practice and standards for managing the Council’s information assets. This has been developed to take account of the standards set by external organisations, such as the NHS in respect of the transition of Public Health to the Council and the requirements of the Public Sector Network (PSN) Code of Connection (CoCo).
The framework consists of the following areas:
- IGF Principles
- The Information Governance Strategy.
- Appropriate Information Governance Responsibilities.
- Information Asset Ownership
- An Information Governance Structure.
- Effective Information Governance policies and procedures.
- An Information Asset Register
- An Information Risk Register
- Information Governance communication and training.
There are a number of policies and procedures in the framework which deliver the outcomes of the Information Governance Strategy, including:
- Instant Messaging Policy
- Acceptable Use Policy
- Training, Awareness and Development Procedure
- Agile or Teleworking Policy
- Logical Access Control Policy
- Physical Access Control Policy
- Information Security Incident Reporting Policy
- Subject Access Request Policy
All the policies and procedures are hosted on a “metacompliance” platform, which manages access and control of the policies and ensures that relevant staff have read and agreed to the policies.
All my work is compliant with the IGF and I liaise with colleagues in IG where necessary. Becuase much of my work is with environmental data the risks of infringing the IGF with personal data issues are limited. (K2)
3 My Role: Air Quality Project Manager
My official job title is “Project Manager”. This is currently under review, partly because the extent of actual project management activity is quite limited. The majority of my time is spent managing a network of air quality monitors and the data that arises from the network.
3.1 Key Deliverables
The deliverables for which I am responsible are summarised as follows:
- Data capture rates exceeding 85% for continuous monitoring, 75% for passive.
- Monthly data cleaning (ratification) of continuous data. (K3, K4, K8)
- Annual reporting of air quality data. (S12, S13, S14, S15)
- Calculation of Key Performance Indicators (KPIs) for air quality.
- Developing or revising KPIs as necessary. (K11, K13)
- Ad hoc analysis and summaries of air quality data to support other programmes.
- Ensuring all relevant air quality data are published on the Open Data Portal. (K4, S12)
- Ensuring all relevant technical guidance is followed in relation to air quality management. (S1, K1)
- Responding to relevant requests from stakeholders for air quality data and analysis. (S7)
- Delivery of specific projects such as “Slow the Smoke” (S12, S13, S14, S15)
- Responding to planning applications with an air quality dimension.
3.2 Key Skills and Knowledge
The key skills and knowledge for my role are as follows:
3.2.1 Knowledge
- A good understanding of the legal framework for air quality management in the UK. (K1)
- Familiarity with the relevant technical guidance on assessing air quality. (K2)
- Knowledge of air quality policy and interactions with other domains like transport.
- An understanding of the development management (land use planning) process in the UK including Environmental Impact Assessment (EIA).
- An understanding of the principles of open data and legal framework for public access to data. (K4)
3.2.2 Skills
- Processing and analysing medium sized (up to 10 million observations) data sets
- SQL (SQL server, ODSSQL and MS Access) (K6)
- Excel
- HTML, Javascript and CSS for web development of open data products
- R - packages
openair
,openaq
,sf
,timetk
,fastverse
,deweather
andtidyverse
are relevant to the air quality domain - FME and FME Server for automating web services and data integration
- Time series analysis
- Network telemetry: IP, analogue, wireless, Teltonika RMS
- Technical report writing and comprehension of technical reports relating to the domain
- Communication skills - ability to report technical information to non - specialists
- Project management for small and medium sized projects
- Technical skills related to installation, maintenance and quality control of air monitoring instruments
- Negotiation skills for contract management and securing outcomes in the planning process
3.3 Strengths and Weaknesses
3.3.1 Strengths
- Long experience in the domain and a good level of skills and knowledge
- Strong work ethic
- Motivation to deliver and improve services and air quality
- Wide range of contacts internally and externally
- Collaborative approach to working
3.3.2 Weaknesses
- Wide, rather than deep data skill set
- Limited understanding of statistical theory and advanced analysis
- No exposure to team environment of other data analysts so lacking peer support
- Isolated from corporate data analytics functions
- Limited employer incentive to improve skills
3.4 Areas for Improvement
- Improved understanding of statistical theory and learning
- Power BI (corporate data analytics tool)
- Azure (corporate data platform)
- Python and PANDAS for comprehensive tooling of data science operations
4 Portfolio Projects
I have identified three projects which align with business and apprenticeship progress review objectives. They are focussed on creating data products which will enhance business processes and increase confidence in our data and analysis.
My four objectives for the apprenticeship are listed below. Three of the portfolio projects deliver these objectives.
I am able to work on these projects concurrently and I believe they are all achievable by the end of the apprenticeship.
I have created a GitHub repository for all the projects, and other work as part of the apprenticeship. This enables code sharing, version control and backup of work. I have found this very useful in being able to work on projects flexibly.
All the projects have been mainly implemented using R. I selected R as the overarching computing environment because I have greater experience in programming in R. Another key consideration is that R offers a powerful and comprehensive library for manipulation of air quality data, called openair
, which is integral to the implementation of aspects of some of the projects and similar functionality is not available in other languages.
4.1 Combining Data in Projects
All the projects require data sets to be combined. The first project requires multiple sources of data to be joined with each other. This is usually achieved using dplyr::left_join()
or dplyr::inner_join()
functions, but the purrr::map_dfr()
function is used extensively to iteratively stack multiple data sets when extracted from databases or combine list structures within tibbles. The dashboard project requires meteorological data from the open data portal to be combined with air quality data to generate visualisations such as polar plots. Data are combined by joining on hourly time stamps with dplyr
join functions.
Because the sources of data used in the project are mostly under my control, the risks of combining are minimal as I understand the data characteristics and processes for sourcing. Selecting the appropriate field to join on and type of join is important to ensure that the resulting data is as expected. (K10)
4.2 Project 1: A Data Processing Pipeline for the Statutory Reporting of Air Quality Data
4.2.1 Introduction
Bristol City Council monitors air quality across the city and reports data to government under the Local Air Quality Management (LAQM) regime. This is a statutory duty under the Environment Act (1995) for councils to manage air quality to ensure the quality of the air meets legal limits for regulated pollutants.
In addition to this duty, there is also a requirement to provide monitoring and evaluation of the operation of a Clean Air Zone (CAZ) in Bristol. The CAZ is the main mechanism by which traffic pollution (nitrogen dioxide or NO2) will be reduced to comply with the legal limits.
The two reporting regimes are similar, but different in some respects. For LAQM, the reporting is directly to the government and takes the form of an Annual Status Report (ASR) to be submitted by June 30th of each year. In this ASR are summary tables of air quality data from all our monitoring which are required in a format determined by a template document. For CAZ reporting BCC is required to submit a spreadsheet of summary data in a set format to the Institute for Transport Studies at Leeds, who are the body that is conducting research on all the CAZs in England on behalf of the government. (K1, S1)
The aim of this project is to develop a process for reporting for both of these purposes that limits the need for manual data entry.
4.2.2 The LAQM regime
Local authorities are required to review and assess air quality in their areas and report to government annually. Where concentrations of regulated pollutants exceed legal limits “Air Quality Objectives” (AQO), councils must declare an air quality management area (AQMA) and publish a plan for dealing with the breaches of the AQOs.
Several urban local authorities experience breaches in the AQO for nitrogen dioxide, a traffic pollutant. The AQO is set at 40 \(\mu\)gm-3 measured as annual mean. In Bristol, this is the only pollutant for which exceedences are measured.
The relevant guidance for LAQM, LAQM.TG(22) prescribes the monitoring regime that must be followed for each pollutant, including how data is monitored, processed and reported. The process used in this project will adhere to the requirements of LAQM.TG(22). (K1, S1)
Further prescriptive guidance is provided by Defra for air quality reporting in the form of templates for the report itself and for the tables that should be included in the report. (K2)
For this project, the data process will focus only on tables prefixed “A” in the spreadsheet referenced. These are the tables that contain monitoring data. Other tables in the report require manual data entry as they need data that is sourced from other officers and departments in the council so will not be addressed in this pipeline.
In addition to these tables, there is a requirement to include thematic maps in the report which show concentrations of pollutants at monitoring sites. The process will also produce spatial datasets which can be used in a GIS to plot the maps or will produce graphical output directly for inclusion in the report as image files.
4.2.3 CAZ Reporting
The purpose of the CAZ is to reduce concentrations of one pollutant (NO2) to legal levels in the shortest time possible. Therefore the reporting is limited to NO2. However, the reporting on this single pollutant requires greater granularity than the LAQM reporting. This is partly because advanced techniques for analysing changes in time series air quality data will be used to assess the impact of a specific policy intervention (the CAZ) and therefore the hourly time series data need to be reported, not just the summary statistics. This analysis will be done by ITS using the `AQEval’ R package and the data for Bristol will be supplied by providing access to the open data portal dataset for continuous air quality data. (K11)
The reporting for diffusion tube data consists of providing data in a pre - formatted template quarterly. The data is a combination of meta - data for the measurement site, e.g. location, height, distance from a road and the “raw” monthly concentrations from the diffusion tube. The measurement technique for diffusion tubes and continuous analysers is described later on.
4.2.4 Analysis and Processing Techniques
Summary
The overall process was controlled using the targets
R package. This is a “Make” - like pipeline tool for Statistics and data science in R to help maintain a reproducible workflow. Targets learns how the pipeline fits together, skips costly runtime for tasks that are already up to date, runs only the necessary computation, supports implicit parallel computing, abstracts files as R objects, and shows tangible evidence that the results match the underlying code and data.
Each operation was assigned to a function. Targets ensures that the functions run in the right order, and only when all pre - conditions are satisfied. (K3, K8, K10, K14, S2, S4, S15)
Data Sources
The data needed for the analysis are in five separate repositories:
- A corporate Azure SQL server database (envista) hosting BCC’s continuous air quality data
- A Microsoft Access database for the diffusion tube data and monitoring site meta data
- The council’s open data portal
- UK Government’s Air Information Resource which holds data for the air monitoring sites run by government as part of the Automated Urban and Rural Network (AURN). There are two of these sites in Bristol.
- The LAQM Portal (holds data for the calendar for changing diffusion tubes)
Four of these repositories were used in the analysis. The open data portal was not used as it is planned to cease operation of the contract with the provider in 2023. This is somewhat sub optimal as the open data portal contains all the relevant data needed for the analysis and data is easily accessed using a powerful and well documented REST API. This project was reconfigured to avoid using the open data portal when I became aware that it was to be discontinued. (B7, B6, K10, S9)
Data processing overview
The data processing is split into two parts, because the reporting regime requires the use of a standard excel spreadsheet template for reporting. The first stage generates the input data for the template using a targets
pipeline. The second stage takes output from the template spreadsheet and generates further files for inclusion in the final report, including charts as images, csv files for updating the open data portal and ESRI shapefiles for mapping in ArcGIS.
The data from databases was accessed using a combination of the DBI
, dplyr
and odbc
packages (K6). Data from the LAQM Portal was retrieved and processed with the rvest
package for web scaping. For the data that is hosted on the government’s website, the openair
package provides a helpful function called importAURN
. Functions written using a combination of tidyverse
and base R and controlled by targets
implement the ETL processes for stage 1. The dataframes are exported as csv files ready to be pasted directly into the spreadsheet template or as an Excel spreadsheet with individual worksheets corresponding to the specific tables.
For the generation of spatial output, the sf
package is used to create shapefiles for use in ArcGIS and output graphics as png files.
Data Processing Approach
All custom functions are contained within the functions.R
file which is called by _targets.R
. The dates of interest are contained at the top of the file and are set by the user. These are beginning and end of the calendar year for reporting.
Diffusion Tube Calendar Table
Diffusion tubes are passive samplers that measure air quality (NO2) for a defined period. The period is determined by a calendar published on the internet and this must be adhered to so that data is consistent. The dates for the calendar are extracted using the web scaping library rvest
in a two stage process. Firstly a function scrapes the two html tables from the defra web site.
Secondly, the relevant dates for the year of interest are selected, cleaned and output as a tibble.
Several utility functions are used throughout the pipeline in other functions. They are not reproduced here, but serve to clean data, provide consistent themes for plots and process dates.
Database Connections and Security
Connections are made to corporate databases to retrieve data. To maintain security and comply with organisational data processing protocols, the connection details are not embedded in code. This would be a security risk, if for example code is shared on a public Github repo. The code below illustrates the use of the config
packages to hold connection credentials in a config.yml
file which is referenced by the function and added to a .gitignore file so that it is not published online. (K2, K6, S1)
Retrieve Continuous Data
Continuous hourly pollution concentration data from all current sites are retrieved from the envista database using the function below. The function takes the start and end dates supplied and a vector of the site id’s, which are supplied in the targets pipeline. The time base of interest for this process is hourly, but the function can be used to select a different time base, for example 15 minutes. Another tibble final_tbl
is also supplied. This holds metadata about the monitoring sites, for example which channel (column) relates to which pollutant. The function iterates over a list of monitoring sites, collecting the data for the relevant time period and cleaning and returning a long format tibble with the consolidated data. (S4, S2)
Data from National Air Monitoring Network
In addition to continuous data from sites operated by Bristol City Council, we also use data from monitoring sites operated by UK government. These are know as AURN (Automated Urban and Rural Network) sites. The openair
package is used to retrieve data from these sites using the function shown below. The code first tests to see if the data is available via the test.import.aurn()
function. If so, it retrieves using the importAURN()
function. If not, it uses an alternative source, importing by csv in another function (omitted for brevity).
In addition to local monitoring data from government sites, it is also necessary to get background data from within 50 miles to use to annualise incomplete data from sites where data capture is less than 75%. The function below does that, using the importMeta()
function to filter sites by latitude and longitude in proximity to Bristol and returning data from these sites as a consolidated tibble. (S6)
Reshaping Diffusion Tube Data
After diffusion tube data are retrieved (code not shown for brevity) the data are reshaped for later insertion into the output spreadsheet.
During development of this pipeline, a mistake was made in the dates on which diffusion tubes were placed, i.e. they were not placed in accordance with the calendar. I sought guidance from the LAQM Helpdesk (a resource to assist local authorities in managing air quality). They confirmed that we could publish the data but that the incorrectly exposed concentration periods could not be used in the calculations for annual mean values. (B7)
This required adaptation of the code below and the MS access database, such that a new field was added in the table to flag if the data could be used in the analysis. The code below implements that - as noted in the comment. (B5, B6)
I have instructed our member of staff to regularly review the diffusion tube calendar and plan their changeover dates accordingly as a result of this error.
Annualising Incomplete Data
When there is less than 75% data capture for a monitor, the data are adjusted using the background data collected above. To do this it is necessary to know the “background” concentration of NO2 at the site. This is given in terms of a 1km grid square identifier in the MS access database. The identifier can then be used to derive a concentration for NO2 in that 1km square for a given year. The function below receives the database connection, the start date and a vector of site ID’s which need annualising. It gets the subset of site ID’s which need to be, and can be, annualised and joins tables to return a table of background concentrations for the sites. The annualisation itself is carried out in the pre - formatted reporting spreadsheet. (K10, S4)
Making Tables for the Reporting Spreadsheet
The reporting spreadsheet for the ASR requires data to be entered in a prescriptive format. The tables are numbered e.g. “Table A1” to “Table A9”. Not all the code to create the tables is shown, for brevity, but an example is given below. This is quite a simple example and populates a table of meta data for monitoring sites and includes a function to generate the required character string to describe the monitoring technology used.
Most of the output tables require a descriptive statistical summary in order to compare the measured concentrations with the relevant air quality objective. Examples of air quality objectives are shown below. They are usually an annual mean value or a number of exceedences of a threshold value per year, sometimes expressed as a percentile. These statistics are calculated in the R functions within the pipeline. (K13, S10, K14, S11)
List of Reporting Tables
All the reporting tables are added to a list in the function below, which takes an unspecified number of tibbles, adds them to a list, and removes the “NA”s generated when the contents are turned into character data types. This is so that the reporting spreadsheet does not contain multiple “NA”s.
Write Spreadsheet
The list of tables generated in the function above (and others) is written to the final reporting spreadsheet using the function below. This uses the write_xlsx
function to write each item in the list supplied to a separate worksheet in the spreadsheet.
Graphical Output
In addition to the data required in the reporting spreadsheet for LAQM, the ASR report also makes use of charts to show, for example, trends in pollutants. The code below is included in the pipeline and iteratively generates charts for pre - determined locations in Bristol. make.no2.trend.chart.tbl()
takes tables of the annual concentration data for current and previous years, a tibble of areas to be plotted and the monitors in that area, and the monitor site metadata tibble. It generates plots for each area using ggplot
and mapping over each area. Each plot is a nested object in the output table, which also includes a column with the file name of the output image.
The function at the bottom of the screenshot takes that tibble and uses the purrr::pwalk()
function to iteratively save each plot as a graphics file in the path provided.
An example of a trend chart for an area is shown below. (S14)
Spatial Data Output
Because there is a strong spatial element to air quality data, the ASR contains several maps, showing how air quality varies across the city. The data processed in the pipeline is output as a shapefile to support mapping in ArcGIS software. R could be used for mapping but generally a higher quality mapping product is more easily produced using GIS. However the source data can easily be generated using the function below.
The targets
Pipeline
All the functions shown (and more) are called and run in a pipeline governed by the targets
package. This implements a “Reproducible Analytics Pipeline” whereby the dependencies are implicitly defined by the pipeline itself, and redundant code is not re - run. The names of the targets are the objects created, which are hashed and stored in /targets/objects as .rds files. Targets which have not changed since the last run will not be re - run. A truncated example of the critical targets pipeline is shown below.
4.2.5 Project 1: Summary of Activities
In Project 1 I carried out the following activities.
- Defined the objectives for the project, namely
- Summary data tables for ASR spreadsheet in specified format
- Appropriate graphic output for ASR (S14)
- Updated spatial data for most recent year’s air quality data
- Summary data for CAZ reporting in specified spreadsheet format
- Developed code and functions using R to deliver objectives
- Adapted code and working methods to accommodate changes in requirements (incorrect exposure periods) (B7, B2)
- Adopted best practice regarding organisational data requirements to secure credentials (S9, B1)
- Investigated and implemented a reproducible analytics pipeline using
targets
(B5) - Documented the code and analytics approach using literate programming (this Quarto document)
- Implemented version control using Git and GitHub to ensure reproducibility and tracking
4.2.6 Project 1: Learning and Reflection
The process of reporting annual air quality data has changed somewhat over the 25 years I have been doing it. There is a wide range of tasks needed and several different data sets to process. There is a great deal of potential for human error, and I have made mistakes in the past when processing this data. I thought it was a useful project to tackle as part of the L4 apprenticeship because greater rigour in the analysis process and a carefully designed automation process should minimise the errors.
Initially I designed the process to use the open data portal where we publish our air quality data, but as the project developed I became aware that the open data portal was to be decommissioned in 2023. I therefore revised the design to refer directly to the canonical databases to source data. (B6, B7)
I started developing the processing in one R script. This soon became unmanageable so I investigated options to help manage the process. I learned about Reproducible Analytics Pipelines and read the book: Building reproducible analytical pipelines with R. I adopted as many of the approaches recommended in the book as possible, and I feel that this has greatly improved the quality of the code and analysis used in the project. (B5, B6)
Air quality reporting is unfortunately somewhat constrained by the prescriptive and closed requirements of the government department (Defra) which governs the activity in local government. Because of this it is quite difficult for local authorities to share their data in a coherent and consistent way such that a single view of air quality across the UK is possible. If open data approaches were enabled for local authorities by Defra it would enhance understanding of air quality for the public and increase access for researchers.
There are undoubtedly improvements that I could make if I had more time. I would like to do more unit testing of the functions used and ideally build an R package to implement the functions I have developed. I would also like to improve the data management operations upstream of the reporting process, for example the database used to store diffusion tubes and monitoring site meta data is currently MS Access (2010). Moving this to the corporate data lake or a data warehouse would be more robust and enable integration with other data sets in the council. (B6)
Even though I have designed the analysis to be reproducible and accessible, I acknowledge that running the analysis requires a working to intermediate level of expertise with R, and domain specific knowledge. This is potentially a problem in the organisation as I am one of the only people in the council with expertise in R. Hence there is a risk to business continuity with only one person able to run the analysis (safely).
The council is centralising data engineering services, but this function is likely to be focussed on generic business intelligence reporting, not specific “niche” domains such as air quality. Hence it is likely that air quality reporting will remain in a specific team. I am training two colleagues in using R and specialist packages for air quality reporting to help mitigate the risk to business continuity of having only a single practitioner.
4.2.7 Project 1: Business Benefits
Project 1 clearly benefits the council’s air quality team by standardising and automating the statutory reporting procedures for air quality monitoring. Rather than having to laboriously manually collate data from disparate data sources and summarise in a tool such as Excel, the entire process can run reproducibly and efficiently with two processes in R.
4.4 Project 3: Comparing Performance of low - cost Sensors with Reference Method Instruments
4.4.1 Introduction
The Slow the Smoke Project is introduced in Project 2. It includes a significant “citizen science” component.
The citizen scientists have each been given a “low cost” air sensor that monitors particulate matter (PM). PM is fine dust in the air, including smoke. There are two fractions of PM that are important for health; PM10 (aerodynamic diameter < 10 μ) and PM2.5 (aerodynamic diameter < 2.5 μ). The devices deployed monitor both of these fractions using a light - scattering technique, where the diffusion of laser light is a function of the concentration of PM in the sampled air.
Because these devices do not directly measure concentration of PM, but use a proxy measure, they are not as accurate as “reference” measuring instruments used to assess compliance with air quality objectives. It is therefore necessary to attempt to characterise the performance of these devices in relation to approved or “reference method” devices by co - locating the low cost sensors with reference method instruments and comparing measurements. This was a stated aim in the grant application for the Slow the Smoke project.
In addition to the community sensors, we will install a reference/equivalence method instrument in the study area to validate measurements through co – location with one or more Luftdaten devices.
The installation of the reference method equipment is ongoing, but co - location was done with existing reference method sites.
The characterisation of performance of the low cost sensors was extended through predictive analytics to develop a suitable model which can be used to predict concentrations of particulate matter to an accuracy approaching that achieved by reference method instruments using signals from the low cost sensors.
In the field of air quality, predictive and prescriptive modelling is usually carried out using dispersion models which are based on a gaussian distribution of pollution within a plume. They also often include algorithms for calculating the effects of atmospheric chemistry for pollutants such as ozone and oxides of nitrogen. Source terms in the model are meteorological data and activity data such as industrial processes or pollutant emissions from vehicle movements. I have used these models in the past to predict concentrations of regulated pollutants across the city, and to test interventions such as a low emissions zone. (K14, S11)
The modelling in this project was not intended to predict concentrations from source terms such as activity and meteorology, but to predict more accurate concentrations of particulate matter than available from an untreated low - cost sensor.
4.4.2 Co - location Study
There are three monitoring sites which measure PM in Bristol. A map and summary data for the two sites used sites are shown below.
Co - location: Potential Sites | |
---|---|
Site description | Pollutants |
AURN St Pauls - 452 | |
The monitoring station is located within a self contained air conditioned unit within the car park of a day nursery. The monitoring station is approximately 30 metres south east of Wilder Street which is a lightly trafficked urban back street. The surrounding area is primarily residential with some commercial premises in the immediate vicinity. | PM2.5, PM10 |
Parson Street School - 215 | |
The Parson Street School monitoring site has been operating since 2002 and was selected as a roadside site that represents residential exposure to relatively high concentrations of traffic generated pollution. The sample inlet is approximately 1 metre from ground level and four metres from the kerb of Bedminster Road where traffic queues for the traffic lights | PM2.5 |
Temple Way - 500 | |
Roadside monitoring station beside multi-lane section of the A4044 (Temple Way). This is a major route in and out of Bristol city centre. | PM10 |
The co - location study used two of these sites; Parson Street School (site 215) and Temple Way (site 500). These sites are both operated by the council. Using the AURN St Pauls site would have required permissions from the Environment Agency which would have been a time consuming and uncertain process.
The airrohr SDS011 fine dust sensors require a wifi signal in order to push data to a server. The Bristol City Council sites did not previously have wifi access available. The telemetry at our air monitoring sites was a combination of 3G modems and analogue land lines in 2021.
In order to accommodate the co - location study and also for the purposes of virtualising our data collection machine, I procured, configured and installed Teltonika RUT950 or RUT955 4G LTE routers at all of our monitoring sites. This required me to learn about network protocols, network security and the configuration of windows updates. (B7)
I liaised with our network team to configure white lists for the Envista ARM application to access the routers. This enabled 4G TCP/IP access to the data loggers or instruments at all of the sites, and also provided a wifi hotspot to enable the SDS011 sensors to send data. (B7)
The physical installation of the SDS011 sensors at both sites was complete in early May 2022.
Colocation Study: Concept and Objective
The aim of the colocation study is to compare the performance of the low cost sensors with the performance of reference method instruments measuring the same pollutant. The method of implementing this comparison was to collect hourly data for the two co - located devices and establish the linearity of the response using a model to report coefficients and r2.
In addition to simply describing the relationship between the low - cost and reference devices, the model was used to predict reference method values (i.e. “real” concentrations from low - cost sensor readings).
Some of the methods for testing the performance of the sensors have been adopted from the United States Environmental Protection Agency (USEPA) report on the Performance Testing Protocols, Metrics and Target Values for Fine Particulate Matter Air Sensors.
Within this study it was not possible to compare the responses of multiple low cost sensors with each other as there was not a budget to purchase additional devices for this purpose.
Reference Method Equipment
Continuous Ambient Air Quality Monitoring Systems (CAMS) are certified by the Environment Agency under their MCERTS scheme. This certifies the quality regimes and equipment for environmental permit holders. Local authorities are required to use MCERTS (or equivalent) equipment for monitoring air quality under the LAQM (Local Air Quality Management) regime. The certification and approval process ensures that the measurement standard provides data of an acceptable quality.
In addition to the specification of the equipment, a local authority is required to adhere to the calibration and maintenance requirements for the equipment as set out in the relevant guidance LAQM.TG(16) and the Local Site Operator (LSO) manual for sites that are part of, or affiliated to the national AURN (Automated Urban and Rural Network).
(K2, K8, K9, S1)
The equipment used in this colocation study is the Met One BAM 1020 Continuous Particulate Monitor, hereafter referred to as “BAM 1020”. The instrument works by drawing a sample of air through a filter tape every hour. The deposited PM is then exposed to a source of radioactive Carbon 14 on one side of the filter tape. A beta radiation detector is on the other side of the tape and measures the attenuation of the beta radiation through the sampled filter. The attenuation of the beta radiation is a function of the deposited PM mass on the filter tape. Because the flow rate of the sampled air is known, the concentration in μgm-3 can be calculated. Hourly concentrations are recorded on an external data logger and polled by a central telemetry system which holds the data on a SQL Server corporate database. (K6)
Colocation Sites: Temple Way
The Temple Way site is affiliated to the national monitoring network. PM10 and NOx are measured at this site; summary metadata are provided below.
Colocation Sites: Parson Street School
Oxides of Nitrogen (NOx) have been measured at Parson Street School for many years. The enclosure is close to the roadside of a busy, queuing road and represents exposure of school children and school staff. In recognition of the need to understand exposure to PM2.5 at a roadside site the monitoring station was updated with a BAM 1020 in 2021. The BAM 1020 when configured for monitoring PM2.5 includes an additional particle size cut off filter and also incorporates a heated inlet to drive off volatile compounds from the sampled air. The summary metadata for the site is shown below.
Colocation Configuration
The low cost sensors were co - located inside the cages of the monitoring sites. Parson Street was installed on 29th March 2022 and Bristol Temple Way was installed on 1st May 2022. The photographs below show the detail of the colocated devices at each monitoring site.
Data Sources
Data from one low cost sensor and from both reference instruments are published in near real time on the council’s open data portal. The BAM 1020 data are available through the air-quality-data-continuous
dataset and the Parson Street data are available through the luftdaten_pm_bristol
dataset.
This is a dataset that is a geographical subset of the sensor.community archive focussed on Bristol. In addition, data are aggregated to give an hourly mean value for both PM10 and PM2.5. Custom functions were written in R to access these files and import the data as data frames. The data for the low cost sensor at Temple Way are not published on the council’s open data portal. This is because the sensor did not register properly on the sensor.community portal. The data are however available through a combination of online .csv and .zip files. (K10, B7)
Data Processing Pipeline
Initially the data processing pipeline was split into three R scripts, one for retrieving and preparing the data, one for plotting, and one for modelling the data. However, while developing this approach, it became apparent that there were significant problems with this method. Tracking the operations across three different scripts became confusing, and testing and debugging was sub - optimal and resulted in errors when running the scripts if they weren’t called in order. Researching this issue identified a possible solution, which was eventually used for this project (B5, B7). The targets
R package is a “Make” - like pipeline tool for data science in R. It helps to maintain a reproducible workflow, minimising repetition, running only necessary code, and forces a functional approach to code and pipeline development.
(B7, B6, S15, K11)
One advantage of using targets
is that the entire processing pipeline can be viewed as an interactive graph using the tar_visnetwork() function. This enables a quick overview of functions, targets etc and helps understanding of the process. An extract from the pipeline visualisation is shown below, highlighting the model development part of the pipeline.
Retrieving and Preparing data for Modelling
Data for the two BAM 1020 instruments and the Parson Street low cost sensor are retrieved using a purpose built R function which uses endpoints from the Opendatasoft API. This is sourced from a script outside the project and is shown below.
For the Temple Way sensor a different approach was developed to retrieve the data, which is available through the Madavi API as a combination of csv and zip files. This operation entails retrieving a number of files in different formats and joining the data into one data frame. The function which retrieves the monthly .zip file urls is shown below.
Once data are retrieved from the online repositories, they are combined into a nested data frame or “tibble” with functions from the purrr
package such that columns are created that can be used in the subsequent modelling process. The resulting nested tibble is grouped by the site ID with list columns containing the raw hourly data, the prepared source data for the model, and a pivoted (wide) data set used in the plotting and modelling functions.
Nesting the model data enables a concise iterative approach to developing the modelling pipeline and means that the modelling process itself generates a data frame with list columns that contain the output of the regression model, as well as plots which explain the data, and model output.
Although data from the reference method instruments are collected at hourly frequency, and the data from low - cost sensors are collected at five minute frequency (then downsampled to hourly) it was decided that the modelling exercise would use daily data. This is to reduce the large variation in hourly data and also to reflect the legal objectives for these pollutants, which are largely set in relation to daily rather than hourly periods.
Plotting
Functions were developed to create a range of visualisations of both the raw and processed data and model outputs. Not all function code is shown here, but some examples are presented.
plot.drift.site.gg()
is a function based on the ggplot2 library which takes the prepared model data tibble and a site ID and returns aggplot
object (plot) showing how the pollutant concentrations from the low cost sensor diverge from the reference method measurements over time. The plot uses thegeom_smooth()
geom with the smoothing set to “gam” in order to show a smoothed trace.
plot.scatter.site.gg()
is another ggplot based function which generates a scatter plot for co - located instruments. It also incorporates functions from theggside
library (extension for ggplot2) which enables side - plots for the scatter plot showing the distribution of data points for each axis. Further it includesstat_cor()
andstat_regline_equation()
from theggpubr
library to show correlation statistics and the equation derived from the linear model on the chart. This function therefore provides much useful information about the relationship between the two data sets on a single chart.
prep_timeplot()
is a helper function that takes the prepared data for modelling and processes it for use in another function. The function subsets the data, renames columns, pivots longer and pads the time series such that the data is ready for theplot_time_series()
function.plot_time_series()
takes the data from step 3 above and a string representing an interval (e.g. “hour”) and plots a time series chart using theggplot2
library.- Other helper functions which format and save plots.
Exploratory Data Analysis
The colocated data were first plotted using the summaryPlot()
function from the openair
package. This gives a schematic overview of the time series, distribution, and summary statistics for the data. It can be immediately seen that the distributions differ, with the low cost sensor data being markedly more right - skewed than the reference data, but with lower overall concentrations than the reference data (K13, S10). It is also notable that significant quantities of data are missing from the reference PM10 instrument at Temple Way. This is because data were removed by the Environment Agency due to data quality issues. This is unfortunately likely to affect the performance of future modelling (K8, S6).
Further exploratory data analysis (EDA) with time series charts showed that the data from the low cost sensors diverged significantly from the reference instruments. The plot below shows daily mean concentrations at both sites. (K8, S6)
The divergence does not appear to be uniform or consistent and seems to drift over the study period, so drift plots were developed to examine this artefact. Drift is significant, and different at each site and for each pollutant.
Scatter Plot and Linear Regression
Scatter plots were made of the co - located data at each site to visualise the hourly data and distributions alongside each other. Model coefficients are displayed on the charts.
Modelling
The relationship between the low - cost and reference sensors should ideally be linear. It is therefore reasonable to identify a linear regression model as appropriate for this task. However, other models may also be useful, so in addition to the linear model, an xgboost model was also deployed to compare with the simple linear model.
The EPA report indicates that the low - cost sensors may be affected by ambient humidity levels. This is also borne out by research by Liu et al (2018). Hence it was decided to include humidity as a term in the model selection process, as well as temperature as another available variable that may affect the relationship.
In order to compare the two models (xgboost and linear regression) using a combination of three regressors (low - cost readings, temperature and humidity), a function was written to:
- Split the data into training and test sets using
rsample::initial_time_split()
function - Run the selected combinations of models and regressors on the training split and store in a list.
- Using the model metrics functions in the
yardstick
package, make a table of model metrics to judge the performance of the model / regressor combinations based on the performance of the models when applied to an unseen testing split.
The results of this analysis are shown below.
It can be seen that the model and regressors that minimised RMSE and maximised R2 for the PM2.5 instruments were the linear model (lm) using just the low - cost data as the independent variable and humidity. Including temperature reduced MAE to a minimum. For PM10 the combination of linear model and low - cost + humidity reduced all error metrics to minima and maximised R2 when compared to all other options. Hence it was decided that to maximise model performance the optimal modelling approach was to use a linear model with the low - cost readings and humidity as predictors for the reference signal. It may have been possible to improve on the performance of the xgboost model further by parameter tuning, but for the purposes of this exercise, the linear model seemed satisfactory.
The model data tibble previously created is used as input to a function to run the linear model on the data for each site. A script to import custom themes for plots is imported at the top of the _targets.R file. The make.selected.model.output.tbl()
function takes the data tibble and creates a new output tibble with the following features:
- Model input data (time stamped daily concentrations and humidity data for entire period)
- A temporal split object for training and testing the model
- Model objects for the training and full data
- Tidied datasets (estimates, standard error and p.values) for training and full data
- Glanced datasets (r squared, p.value and number of observations) from training and full data
- Augmented data - combined observed test data and predicted (fitted) data from the trained model
- A check_model object, visually summarising the model’s performance for training and full data
- Prediction plots showing the predicted daily values from the model for the test split using the augmented data
- A performance summary table for the training and full data
4.4.3 Results
Model Check plots
The eaststats
metapackage in R contains a package called performance
which includes includes the check_model() function. This provides a comprehensive set of plots in a single output which enables the inspection of the model’s performance. The model check function was run on linear models fitted to both the training data and the full dataset. Results are shown below for the full model run for each site.
Inspecting the plots it can be seen that the models perform reasonably well according to the analyses in the check_model() function. The linearity of residuals is good until very high concentrations are reached. This also holds for the homogeneity of variance. Colinearity of predictors is low for both models. Normality of residuals is good for both sites, although at site 500 there is an interesting artefact in the first positive standard deviation where residuals start to diverge from normality slightly. Overall it seems that the linear model performs well at both sites.
(K14, S11)
Summary Tables
The performance metrics for each model are extracted and tabulated into a range of formats for display. the gtsummary
package produces a nicely formatted html table and the performance
package from the easystats
suite of statistical tools is used to generate summary tables.
Model Coefficients
The model coefficients are shown in the tables below for the models fitted to the training data and the full datasets. Note that the coefficients relate to both predictors (low - cost and humidity) and therefore are different to the coefficients shown on the exploratory scatter plots, which just include the low - cost term.
Model Performance
The key metrics for the models are shown in the table below. (K14, S11)
Model Predictions
The models can be used to predict data in the test split. The fitted data are compared with the observed (reference) data in the test set with a time series plot for the daily mean concentrations. The test set is isolated from the training data so this is a true test of the model on “unseen” data.
It can be seen that the model predictions (.fitted) are quite close to the observed reference data at both sites. However the modelled concentrations are not acceptable in a regulatory application such as air quality management. (K14, S11)
4.4.4 Project 3: Summary of Activities
The activities I undertook in Project 3 were as follows.
- Planned an environmental monitoring exercise to evaluate data quality from low - cost sensors as part of a funded project
- Installed sensors and configured telemetry at two sites, liaising with relevant stakeholders to secure permissions
- Developed a reproducible analytical pipeline (RAP) in R using the
targets
package to implement the following processes- Get and combine data from a range of sources using R functions
- Conduct EDA to understand data in the context of the domain
- Use visualisations to understand the relationships within the data
- Test a range of predictive analytics approaches and select an appropriate model (K14, S11)
- Fit and evaluate the model on training \ testing data sets
- Finalise the model and visualise predictions
- Critically evaluated the approach and suggested improvements
4.4.5 Project 3: Learning and Reflection
This project provided me with my first “real - world” exposure to predictive analytics using regression and machine learning models. The growth of low - cost instruments for sensing air quality in recent years means that there is a significant need to understand their performance when compared to high quality instruments, and this project has enabled me to gain a strong understanding of this and to be able to speak with confidence on this issue when advising stakeholders.
The logistical challenges associated with implementing the monitoring campaign are significant. Data capture was an issue, both for the low cost sensors, and the reference instruments. The quality of the study could have been significantly improved if the monitoring had been done for at least a calendar year as the seasonal effect could have been modelled. In addition a greater range of regression models could have been tested to potentially improve the performance of the models. (K14, S11)
If the project were extended I would like to develop the modelling approach into a web application to provide citizen scientists with a tool to adjust data from their sensors to show a “real” (reference) concentration. This would extend the work we have done in our team to explain the performance of low cost sensors and guide potential users of this technology.
In this project I used a combination of the tidymodels
and easystats
metapackages in R. I feel that I did not fully explore or understand the optimal workflows for modelling in these packages and there is certainly potential for improving my understanding of predictive analytics processes. (S13)
I again used targets
to implement the RAP, and I think this was helpful in iterating the project. I have developed a workflow and conventions that seem to work well and should enable others to revisit the work and understand the code.
I could have used Python for this project, but I feel that I was able to make progress more quickly with R as I am familiar with it.
4.4.6 Project 3: Business Benefits
The co - location element of the Slow the Smoke project was a requirement of the grant funding and the predictive analysis of the data summarised in this project has been included in the final report to the funding body (Defra). In addition to discharging our requirements for the project, the work has reinforced understanding of the performance of low - cost sensors measuring particulate matter. As a team, we are often approached by manufacturers and suppliers to use or endorse these products, so a clear understanding of performance helps to evaluate these approaches.
4.4.7 Conclusions
This exercise used a simple linear model to establish the relationship between a reference method instrument and low cost sensor for two different measures of particulate pollution (PM10 and PM2.5) at two different sites in Bristol. Humidity was identified as an important term to include in the model, and helped improve model performance. Coefficients for the models were different depending on the pollutant and site. Nevertheless, the models performed reasonably well, given the limited availability of data. (K14, S11)
The use of low cost sensors in air quality measurement is growing, and large research projects have been established to assess the performance of the sensors and identify machine learning algorithms which can treat the sensor output to improve accuracy when compared to reference method instruments. The work presented here does not approach the rigour in the large research studies, but does indicate that a simple linear model which accounts for humidity may help to improve the accuracy of the low cost sensors. This may be a suitable treatment for the data from low - cost sensors in the context of a non regulatory citizen science project such as Slow the Smoke.
It was unfortunate that a significant subset of data were lost through a ratification process at Temple Way from 10/05/2022 to 25/07/2022 as this significantly reduced the training data available for one model and therefore its potential performance. It is likely that the performance of the modelling process could be improved with a longer time period and more colocation sites. In particular it would be advantageous for the study period to capture an entire calendar year in order to have data representative for all the seasons and weather conditions therein.
A greater range of models could be compared rather than just linear regression and xgboost, and hyper parameter tuning could improve performance of machine learning methods. In addition, although the data modelled were time series, the modelling approach did not incorporate any time forecasting methods. This was partly because of the limited data available, as not all seasonal data were available. Nonetheless, in relation to air quality data, the key influences arise from meteorological conditions and traffic movement (for the sites studied), and this work focused on modelling the relationship between two measured quantities rather than the output of a complex process per se. Hence this quite simple modelling approach seems appropriate for the purpose.