Corralling Machine Learning for Service Line Inventory Development

Insight
Corralling Machine Learning for Service Line Inventory Development
Joanna Cummings Mark Zito

Learn about applying machine learning and predictive modeling to develop service line material inventories in our Corralling Machine Learning for Inventory Development webinar. CDM Smith’s LCR Compliance Coordinator Joanna Cummings, along with Trinnex’s leadCAST™ product manager Mark Zito, provide essential guidance in defining the rules of the road for machine learning. 

Machine learning and other veri­fi­ca­tion methods

When developing a service line inventory, there isn’t a 100% foolproof veri­fi­ca­tion method to identify pipe materials, yet many of these methods are accepted by the EPA and most states. At an achievable accuracy rate of 80% or higher, machine learning, or predictive modeling, can be as or more accurate than other veri­fi­ca­tion methods when reined in and used correctly. Developing machine learning and using it effectively can yield tremendous cost benefits to water utilities working on their inventories.

Machine learning is used to predict if a service line is highly likely to be lead based on historical data and other defined char­ac­ter­is­tics assumed to identify lead material, such as year of instal­la­tion on the utility side, home construc­tion date beyond a certain year, pipe diameter of a certain size, or other neigh­bor­hood and economic char­ac­ter­is­tics. It's used to correlate these char­ac­ter­is­tics with known materials from either targeted or random inspections. This process is then repeated to increase accuracy of the model.

Corralling Machine LearningMachine learning is an iterative process and is best suited for use alongside ongoing field veri­fi­ca­tions. There are two primary uses for machine learning. The first is planning and prior­i­ti­za­tion, and this can be done in the initial stages of the iterative process. The initial model is developed from any available material veri­fi­ca­tions, which can come from previous field work such as water main or meter replacement programs. The initial model can support utilities in prior­i­tiz­ing field inspections, estimating replacement costs, sanity checking confidence in historical records. 

After the initial model is built, it needs to be peri­od­i­cally updated with new data, as more field veri­fi­ca­tions are completed. This is when the model’s hit rate and accuracy should improve.  Towards the end of the iterative process, presuming the model has been enriched with new field veri­fi­ca­tions over time and its accuracy is comparable to other veri­fi­ca­tion methods, we can consider using it to determine service line material in the inventory. It’s important to note that the EPA cites machine learning as a method to assign likelihood of lead to unknown services, but ultimately leaves the acceptance of machine learning as a veri­fi­ca­tion method up to individual states. Check to see what your state permits here.

Leveraging machine learning

Machine learning is a subset of artificial intel­li­gence (AI) that focuses on developing models through learned patterns in data. The model doesn’t rely on programming; it relies on using training data to identify patterns, automate them, and repeat them to identify possible LSL locations and their probability of containing lead. Machine learning can improve upon historical data, expand the impact of field veri­fi­ca­tion methods, estimate where lead service lines are located, and decrease costs with targeted field veri­fi­ca­tions. 

Corralling Machine Learning for LCRRMachine learning models, like Trinnex’s leadCAST™ Predict, can be trained and tested by removing roughly 20% the services with known material (the “test set”) from the model's view, then training the model with the remaining 80% of services with known material (the “training set”) to predict the materials of the other 20% (the “test set”). By comparing predicted material to actual material for the 20% of records in the “test set,” we can assess how the model is performing. The initial results of the model test run can be used to identify targeted field veri­fi­ca­tions. The findings from these targeted field veri­fi­ca­tions are then fed back into the model through multiple iterations. As a result, the accuracy of the model should improve over time.

Corralling Machine Learning for LCRRKeep in mind, machine learning models must be trained for each utility- they are not universal. Utility A may have banned lead in 1950, whereas Utility B enacted the federal ban in 1988. Addi­tion­ally, if there is no lead in a training/test data set, traditional two-class clas­si­fi­ca­tion models will not be able to determine features related to lead and make accurate predictions for lead services. Utilities can overcome this by training on the second-worst material in the service lateral, such as galvanized. Training the model with specific housing data in the area, instead of general census data, which is aggregated and not house-specific, will also yield more accurate results.

How can utilities get started?

Developing an inventory and initial steps to using machine learning are similar, so utilities do not need to fully commit to machine learning to get started. 

Corralling Machine Learning for LCRRInputting third-party data sets into the model is helpful to train it. Parcel and tax assessor data, such as building age, lot size, home value, zoning and land use can inform the model of the possible presence of lead. Housing data such as age, square footage, number of beds and baths, and sale price are also key data sources that help the model learn. Demographic data like census tract or block and age of population are helpful for model training as well. If you need assistance selecting the proper input data for your machine learning model, contact Trinnex. Their dedicated team specializes in using predictive modeling to combat unknown service line materials using the leadCAST Predict™ model.

Machine learning is not an entirely new process—it is an option to automate a utility’s efforts to find lead more efficiently, informed by historical records and physical veri­fi­ca­tions. It helps fill in the gaps to make records as useful as possible, improves accuracy, and reduces expensive, disruptive, and time-consuming field veri­fi­ca­tions. By combining machine learning with human intel­li­gence, communities can improve their service line inventories and focus efforts on improving public health in a more efficient and cost-effective way. 

Joanna Cummings Joanna Cummings
Joanna Cummings, PE
LCR Compliance Coordinator, CDM Smith
Mark Zito Mark Zito
Mark Zito, CFM, GISP
Product Manager, Trinnex
Trinnex leadCAST Predict machine learning model Trinnex leadCAST Predict machine learning model
Check out how Trinnex's machine learning model, leadCAST Predict™, can learn from your field data and make smart guesses for unknowns that are 95% accurate.

See our work in lead in drinking water