BiometricsSA: Statistics for the Agricultural, Aquatic, Biological, Environmental, Food and Wine Sciences.

Current Postgraduate Research

QTL Analysis

Student: Paul Eckermann
Supervisors: Assoc. Prof. Ari Verbyla, Dr Brian Cullis, Prof. Robin Thompson
Commenced: March 1999

This research is concerned with finding the best methods for mapping quantitative trait loci (QTL), in particular in double haploid field trials of crops such as wheat and barley. In particular, the impact that spatial variation and other sources of variation have on the detection of QTLs is being investigated and the best way to allow for this variation during the analysis is being determined. The extension of these methods to multi-environment trials (METs), where genotype by environment interactions may occur is also considered.

This research is very important to the grains industry in Australia. A lot of money has been spent on programs such as the National Wheat Molecular Marker Program (NWMMP) and the National Barley Molecular Marker Program (NBMMP) to locate the genes that control various traits of interest. So it is important that the methods used to find these genes are optimal. If genetic information can be successfully included in breeding programs, the time to release of varieties can be reduced dramatically, with obvious benefits to the industry.

Top


Whole Chromosome methods for mapping QTLs

Student: Scott Foster
Supervisors: Assoc. Prof. Ari Verbyla, Dr Wayne Pitchford, Prof. Cindy Bottema
Commenced: February 2003

This research is concerned with the estimation of effects and locations of quantitative trait loci (QTL) in half-sib populations. Initially interest is focused on the manner in which QTL analyses are currently performed in these populations and ways to immediately improve analyses. This means topics such as calculation of transmission probabilities (given that the common parents haplotypes are unknown) and the inclusion of markers as cofactors: how to identify them and how to use them in the models.

The final goal is to develop some new and novel techniques that involve mapping whole chromosomes at a time rather than single markers or interval mapping. If this can be achieved then the accuracy, flexibility, believability and probably speed of analyses of QTL data should all increase. This is very important in a practical sense as there is a huge amount of money being spent on molecular marker experiments (both in plants and animals), it would be a sad event if all this effort went to waste because of poor or misused statistical methods.


Spatial statistics for discrete data, with applications in weed management

Student: Kathy Haskard
Supervisors: Assoc. Prof. Ari Verbyla, Dr Brian Cullis
Commenced: June 2001

Statistical methodology is well developed for the estimation of the distribution of spatial field data such as yield and soil pH. These methods typically assume the data are continuous and normally distributed. Diggle, Tawn and Moyeed (1998) showed that these methods are unreliable for data that are discrete and therefore not normally distributed.

This project will examine the accuracy of conventional spatial techniques when applied to discrete data, and more recently developed techniques such as outlined in Diggle, Tawn and Moyeed (1998), with the aim of developing accurate and practical statistical methods for spatially distributed discrete data. We will focus particularly on a spatial data set of seed counts from soil cores across a field, containing many small counts.

In addition to being discrete and highly non-normal, weed seedbank data are often highly spatially heterogeneous, which can result in unreliable estimates using conventional sampling techniques. Optimal long-term management of weeds requires accurate determination and monitoring of seedbanks at a field or farm level. Thus it is also intended to compare various sampling strategies to determine the most statistically efficient and practical strategy.

References

  • Diggle, P, J., Tawn J. A. and Moyeed, R. A. (1998). Model-based geostatistics. Applied Statistics, 47, 299-350.

Top


Analysis of Stability data in the Pharmaceutical Industry

Student: Andreas Kiermeier
Supervisors: Assoc. Prof. Ari Verbyla, Dr Richard Jarrett
Commenced: February 1998

As part of any Investigational New Drug application (IND), New Drug Application (NDA), New Dosage Form application (NDF), Abbreviated New Drug Application (ANDA), as well as Post approval monitoring, the stability of the drug product has to be investigated. In these stability trials the drug product is placed in storage under controlled test conditions, and the potency of the product is analyzed at given time intervals. The aim of the stability trial is to determine the length of time, called the shelf-life, for which the drug product meets specifications. This research concentrates on the analysis of data that arise from such stability studies and the determination of the shelf-life.

The Food and Drug Administration (FDA) has published new draft guidelines entitled "Guidance for Industry: Stability testing of Drug Substances and Drug Products" (June 1998). These guidelines attempt to give guidance on the analysis of stability data, but no proper mathematical definition for shelf-life is given. Furthermore, the guidance given for the analysis of several batches is not very clear in its mathematical and statistical requirements.

This research has now established a precise definition of shelf-life and has produced some good methodology for analyzing stability data from a single batch. Currently, the focus is on analyzing and combining data from multiple batches in order to determine a shelf-life that will hold for all future production batches.

Top


Extensibility in Wheat

Student:Patrick Lim
Supervisors:Assoc. Prof. Ari Verbyla and Dr Alison Smith (NSW Agriculture)
Commenced:March 2003

Plant breeders measure numerous traits in selecting the best possible lines for release into the Australian grains industry. These traits include: agronomic, quality, disease, abiotic and biotic stresses.

There has been a lot of statistical methodology developed for the design and analysis of yield trials. Spatial methods are now widely accepted and used throughout all of Australia's public breeding programmes. Most commonly yield is one of the traits used in determining a breeding line's suitability for promotion and subsequently release to Australian farmers.

The flow on of these statistical methods to measurement's for quality has been limited, this is partly due to practical laboratory constraints and timeliness. Generally selection for quality is based on bulked samples or the analysis of single replicates and reported results consist of 'raw' averages. Extensibility (dough stretchability) is an important quality characteristic in the wheat industry and currently there are several different methods for measuring 'extensibility'.

This project will lead to an improved understanding of the relationship between the tests for extensibility. This is vital both for selection purposes and for the identification of Quantitative Traits Loci. Understanding may be improved with the use of a multivariate approach for analysis. Such an analysis would provide an estimate of the genetic variance matrix which comprises the genetic variance for each test and the genetic covariance (thence correlation) between each pair of tests. The standard multivariate approach to analysis involves an unstructured (US) model for the variance matrix. It may be useful to use a Factor Analytic (FA) model (see Smith et al., 2001a) instead. This would provide a range of information, including (a) the FA (or US) model would establish whether certain tests have a strong genetic correlation. If this was consistent across a range of data-sets it would indicate that some tests are not necessary. This would be a key finding since the cost and difficulty of some of these tests is high (b) the FA model may identify the underlying "latent variable" that is being measured by the tests. This would help with an understanding of the tests and may form the basis of a selection index.

References

  • Smith, A.B., Cullis, B.R. and Thompson, R. (2001). Analyzing Variety by Environment Data Using Multiplicative Mixed Models and Adjustments for Spatial Field Trend. Biometrics 57, 1138-1147

Specialized generalized linear models for Ordinal Data

Student: Debra Partington
Supervisors: Assoc. Prof. Ari Verbyla, Dr Arthur Gilmour, Dr Raul Ponzoni
Commenced: September 2002

The project 'Early selection of Merino rams for improving lifetime wool production and quality' (referred to hereafter as Project DAS 101) was conducted by SARDI between 1989 and 1997 (Ponzoni et al. 1995). It represents a comprehensive genetic study of South Australian Merino sheep, arising from the need to match breeding and genetics research with the requirements of industry. Records on quantitative and qualitative traits associated with wool production and quality were collected at ages, which reflected industry practices.

This study will review models for ordinal data with special emphasis on the inclusion of random effects, estimate phenotypic and genetic parameters for the subjectively assessed characters using standard mixed model methodology and develop and use novel methods based on the ordinal nature of the subjective scores to estimate the phenotypic and genetic parameters for the subjectively assessed characters. Standard mixed models and the novel approach will be used to make recommendations regarding the analysis and reporting of such information by performance recording services. The genetic association between the subjectively assessed characters and the lifetime wool production and quality will be estimated. Improvement of breeding values for structural soundness traits will be recommended enabling better control and genetic improvement of such traits.

References

  • McCullagh, P. (1980). Regression models for ordinal data (with discussion). J. Roy. Statist. Soc. B 42, 109-142.
  • Thompson, R. and Baker, R. J. (1981). Composite link functions in generalised linear models. Applied Statistics 30, 125-131.
  • Hedeker, D. and Gibbons, R. D. (1994). A random-effects ordinal regression model for multilevel analysis. Biometrics 50, 933-944.
  • Henderson (1984), Applications of linear models in Animal Breeding

Top


Dual Modelling of Mean and Dispersion

Student: Julian Taylor
Supervisors: Assoc. Prof. Ari Verbyla, Dr William Venables
Commenced: March 1998

Modelling of dispersion has a long history in statistics. For example, Park (1966) and Harvey (1976) describe the applications for economic data in the presence of variance heterogeneity. Recently Verbyla (1993) and Smyth (1989) generalise these ideas to Gaussian and non-Gaussian error distributions respectively, under maximum likelihood. Restricted maximum likelihood (REML, see Patterson and Thompson, 1971; Harville, 1977 and Verbyla, 1990) for the Gaussian error distribution is discussed by Verbyla(1993) and Smyth and Verbyla (1996) extend these results for double generalised linear models.

Many extensions to the dispersion model are possible. In particular, the inclusion of complex structures in the dispersion is a growing area of research (see Nelder and Lee, 1996; Rigby and Stasinopolous. Currently in this research, the inference for maximum likelihood and restricted maximum likelihood under Gaussian and non-Gaussian random effects are being studied. Under maximium likelihood it can be shown in at least one case the likelihood for the response is exact. Under approximate theory robust estimates are obtained using an extension of the Laplace approximation and other techniques.

References

  • Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and related problems. Journal of the American Statistical Association, 72, 320-340.
  • Harvey, A. C. (1976) Estimating regression models with multiplicative heteroscedasticity. Econometrica, 44, 460-465.
  • Park, R. E. (1966). Estimation with heteroscedastic terms. Econometrica, 34, 88.
  • Patterson, H. D. & Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika, 58, 545-554.
  • Rigby, R. A. and Stasinopoulos, M. D. (1996). "Mean and dispersion additive models" in Statistical Theory and Computational Aspects of Smoothing. Physica-Verlag ,215-230.
  • Smyth, G. K. (1989). Generalised linear models with varying dispersion. Journal of the Royal Statistical Society Series B, 51, 47-60.
  • Smyth, G. K. and Verbyla, A. P. (1996). A conditional likelihood approach to REML in generalized linear models. Journal of the Royal Statistical Society Series B, 58,565-572.
  • Verbyla, A. P. (1990). A conditional derivation of residual maximum likelihood. Australian Journal of Statistics 32, 227-230.
  • Verbyla, A. P. (1993). Modelling variance heterogeneity: residual maximum likelihood and diagnostics. Journal of the Royal Statistical Society Series B, 55, 493-508.

Top


Variance Estimation in Mixed Models

Student: Emma Wilkinson
Supervisors: Assoc. Prof. Ari Verbyla, Dr. Brian Cullis, Dr. Robin Thompson
Commenced: February 1999

Various iterative schemes have been proposed for Residual Maximum Likelihood (REML) estimation in linear mixed effects models. These include the Average Information (AI) algorithm (Gilmour et al, 1995), the Expectation-Maximisation (EM) algorithm (Dempster et al, 1977) and the Parameter Expanded EM (PX-EM) algorithm (Lui et al, 1998). GENSTAT through the REML directive uses either the AI or Fisher Scoring (FS) algorithm, which can result in unreliable convergence sequences for more complex models. Foulley and van Dyk (2000) compared the convergence rate of the EM, ECME (Liu et al, 1994) and the PX-EM algorithms for the analysis of three data sets. Their results indicate that the PX-EM algorithm generally converges faster than the EM and ECME algorithms, though the convergence still requires a surprisingly large number of iterations. Unfortunately their study did not include a direct comparison of derivative based schemes such as the FS or AI algorithms. This project involves a comparison of derivative based schemes and EM type schemes.

References

  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1-38.
  • Foulley, J.-L. and van Dyk, D. A. (2000). The PX-EM algorithm for fast stable fitting of Henderson's mixed model. Genetics, Selection and Evolution 32, 143-163.
  • Gilmour, A. R., Thompson, R. and Cullis, B. R. (1995). Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440-1450.
  • Liu, C. and Rubin, D. B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 81, 633-648. Liu, C., Rubin, D. B. and Wu, Y. N. (1998). Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85, 755-770.

Top