Napovedovanje biorazgradljivosti z regresijskimi drevesi

Abstract

The biodegradability of a chemical compound must be considered when estimating the safety of its use for the environment. Because of the huge number of various chemicals, it is practically impossible to experimentally determine biodegradability for all or at least for a significant number of them. A possible solution to this problem is quantitative structure-activity relationships (QSAR) analysis. We experimentally test a representative group of chemicals and then build a model which satisfactorily describes the tested as well as unknown chemicals. The model can be built with classical linear regression methods or with machine learning methods, tipically regression tree building methods. A comparison between these two types of models is made in this work. For several data sets, models with Cubist and RETIS regression tree building systems are built. All models are cross-validated and the best ones are inspected by domain experts. For small sets of compounds with similar structure, models built with linear regression are usually more accurate than models built with regression trees, although the latter sometimes have comparable accuracy and are easily understandable. For large sets of structuraly diverse compounds, regression trees yield more accurate models than linear regression.

Publication
Diplomsko delo (BSc Thesis)