Administrative
Data
|
Administrative data, or administrative by-product, is a type of data that is produced in the everyday
workings of organisations. Examples include counts of births, deaths, marriages
and divorces; hospital admissions; car sales; median house prices or information
relating to case management, such as Centrelink statistics (the number of age
pension or Jobseeker recipients).
|
Base Year
|
The base year is the
starting point in any time series.
If we had an index where 2015 was
the base year, the value for 2015 would be 100, allowing simple comparability
to the base year in the form of a percentage. The Australian Bureau of
Statistics (ABS), Consumer
Price Index has a base year which changes
periodically, with the last change occurring in September 2012 from 1989–90
to 2011–12.
|
Break in Series
|
A break in series occurs
when a change is made to a data collection and the new data is no longer
comparable to the previous data. However, a break may not necessarily
jeopardise the reliability of a time series.
For example, the ABS Overseas
Arrivals and Departures dataset has had
a number of breaks in the series resulting in data not comparable across the
years. In 2017 a review was undertaken leading to changes in methodology, where
data was sourced and the way it was processed. These changes created a break
in the series. The ABS re-released a 10 year time series based on the new
methodology.
|
Census (population)
|
Whilst a sample survey only
counts a proportion of the population, the purpose of a population census
is to count everyone within the assessed population. Population censuses are
conducted periodically by many countries to assess the demographic makeup of the
population residing in that country. Australia’s official Census (of Population and Housing) is run every five years and provides a snapshot of the
entire population.
S Surbhi, ‘Difference between Census and sampling’, Key Differences, updated 19 August 2017,
accessed 29 May 2021.
|
Confidence Intervals
|
Confidence intervals are a statistical measure of uncertainty, expressed as a
range of likely or possible outcomes. This uncertainty derives from the fact
that statistics about large populations, for practical reasons, are usually
drawn from samples of that population. A confidence interval provides a range
in which the ‘true’ population mean is expected to sit, based on the sample.
The mean age of a sample
population, for example, is only an estimate of the average age of that
population. The average age of a sample of people might be 35, and based on
that sample statistic and the variances of ages within the sample, a
confidence interval over the mean might estimate that there is a 95 per cent
probability that the true average age of the population is between 30 and 40.
|
Confidentiality
|
Confidentiality refers to protecting the privacy of information
collected from individuals and organisations. This means that when
information is made available, it needs to be done in a way that is unlikely
to allow individuals or organisations to be identified. Maintaining
confidentiality is both a legal and ethical obligation, and a failure to
maintain confidentiality is called a confidentiality breach, or disclosure.
The ABS apply confidentiality to
their Causes
of Death dataset to protect individuals. As a
result, some totals will not equal the sum of their components. Where figures
have been rounded, discrepancies may occur between totals and sums of the
component items.
|
Constant and Current Prices
|
Constant and current prices are
also known as real and nominal prices. Constant prices are prices that
have been adjusted for inflation, and as such reflect the value of the price
in present day terms. Current prices make no adjustment for inflation,
reflecting the value of the price at the time it was measured.
For example, given a price
valued at $130 in 2019 and an inflation rate of 3%, the constant price would
be 130*1.03, or $133.90, in 2020. In this sense, $133.90 in 2020 is equal to
$130 in 2019. However, the current price would be the given price in 2019, which
is $130.
Using constant prices allows a
comparison between two points in time, for example a basket of food in 1980 and
2020. Another example is measuring the change in real wages over a specified
period of time.
Constant prices can be used to
assess changes in values over time in real terms, which can be tied to
funding allocations.
|
Correlation and Causation
|
Correlation is a measure of the relationship between two variables,
describing how one variable moves with another. Correlation ranges from -1,
where two variables are perfectly correlated with a negative relationship, to
1, where two variables are perfectly correlated with a positive relationship.
Correlation does not mean that change in one variable causes the other to
change.
When one variable causes another
to change, this is called causation. There is no statistical measure
that establishes causation, instead it is established using experiments and inference.
It is important not to mistake correlation for causation.
|
Cross-Sectional Data
|
Cross-sectional data are the result of a data collection, carried out at a
single point in time. Cross-sectional data is different to time-series
data which observes changes over time.
|
Distributions
|
A statistical distribution
is a graph of the possible values of a random variable with the associated
rate at which they are likely to occur. The most commonly used distribution
is the ‘normal’ (or ‘bell curve’) distribution,
which features higher occurrence closer to the mean (or average).
We can view the distributions of
data samples to obtain important information about that data. For example,
assessing the distribution of household income in Australia would highlight
the number of Australians living on lower incomes, or the size of the middle
class.
|
Equivalised
|
The ‘equivalisation’
process is typically used in relation to measuring household incomes, to
enable comparisons of the economic well-being of different households.
If you simply looked at the
household income of different households, you would assume that the household
with the highest total incomes would be the most well off. However, this
doesn’t take into account the fact that larger households require higher
levels of income to maintain the same standard of living as smaller
households. The equivalisation process adjusts disposable income by an
‘equivalence scale’ equal to 0.5 for a second adult and 0.3 for a child less
than 15 years old. These scales take into account the economies of scale
associated with sharing dwellings, and the fact that children require less
resources than adults. A couple household with disposable income of $1,500
per week, for example, would have equivalised disposable income of $1,000
($1,500 divided by 1.5).
|
Estimates
|
Statistical inference can be
used to apply estimates to populations. For example, if we have a
sample and hold the assumption that it is representative of the population,
we can use that sample to estimate characteristics of the population.
|
Index
|
Indexes, or indices, are used to compare numbers as they develop
over time. It is usual to fix the
first observation to a base value of 100, then having all the following
observations linked to this base to compare any relative changes over time.
For example, if we had an index
measuring the price of cars, the initial year of the bundle of cars used in
our measure would be set to 100. If the price of cars rises 10% over the
year, the next year’s index would be 110.
Frequently used indexes include
the Consumer
Price Index (CPI) and Wage
Price Index (WPI).
|
Longitudinal Data
|
Longitudinal data, also known as panel data, is data collected over time
which tracks the same sample of participants at different points in time.
Longitudinal data can be used to assess how different factors may influence
opinions over time, amongst other things.
One such example is the Household, Income and
Labour Dynamics in Australia (HILDA) Survey,
which commenced in 2001. HILDA is a household-based panel study that
collects information about economic and personal well-being, labour market
dynamics and family life. The HILDA Survey allows researchers to track the
employment and income outcomes of participants and whether they progress onto
better outcomes.
|
Mean
|
The mean, also referred
to as the average, is found by adding all data points and dividing by the
number of data points.
For example: (10 + 10 + 20 + 40
+ 70) / 5 = 30
|
Median
|
The median is the middle
number; found by ordering all data points and picking out the one in the
middle (or if there are two middle numbers, taking the mean of those two
numbers).
For example:
1, 2, 3, 4, 5, 6, 14
The median is
useful for measuring the midpoint of, for example, income distributions. This
is because the average of such measures may be influenced by outliers. In the
example above, 14 can be identified as an outlier, as it is not consistent
with the rest of the data.
|
Metadata
|
Metadata is data that provides information about other data. For
example, data relating to a document on the internet is metadata, such as the
type of file it may be, the amount of people that downloaded it, and when it
was accessed.
|
Mode
|
The mode of
a set of data values is the value that appears most often.
For example:
2, 5, 6, 2, 2, 2, 5, 6, 2 = 2
The mode can
be used in initial assessments of data to obtain information. If we were
assessing the ABS’s household income data, the mode allows us to identify which
income group is most common in Australia. In this case, the most common group
is $3,000-3,499 in 2017-18.
|
Moving averages
|
A moving average uses a
set number of data points to create the mean of the data points, moving over
time as new data is added and older data removed. There are various types of
moving averages, including simple, weighted, and exponential. The most
frequently used moving averages are four quarter and 12 month moving
averages, which are used to smooth volatile data, such as regional labour
force survey estimates.
For example, a moving average
which considers 5 data points has the initial set of {3, 3, 4, 6, 8}. As
such, the initial moving average is (3 + 3 + 4 + 6 + 8) / 5 = 4.8. However,
suppose we get a new data point 10. Given our moving average only considers 5
data points, we add 10 to the set and consider the last 5 data points. As
such, our new set is {3, 4, 6, 8, 10}, and our new moving average is (3 + 4 +
6 + 8 + 10) / 5 = 6.2
|
Original, seasonally adjusted
and trend estimates
|
Original, seasonally adjusted
and trend estimates are different forms of time series estimates.
Original estimates best capture actual movements in the data.
Seasonally adjusted
estimates take the originals and remove
seasonal trends (including holiday periods such as Easter and Christmas) to
create more consistent data which is less affected by irregular trends.
Trend estimates further smooths seasonally adjusted estimates to create
a view of the data that reflects a long-term trend. Trend data is best used
to create a view of how the future may play out but fails to perceive monthly
movements in the data. The COVID-19 pandemic has resulted in the suspension
of the trend series in many ABS data collections, due to the importance of
month-to-month changes.
|
Outlier
|
An outlier is a data
value that is very different from most of the other values in a data set. Due
to this difference the outlier may have significant impact on statistics
drawn from the dataset. Outliers in datasets often require further
examination to tell whether they are meaningful or not.
|
Percentage
|
A percentage (%) is the term used to express a number
as a fraction of one hundred, it compares one value in relation to another.
If we held an election, and
party A received 53 out of 90 votes, with party B receiving the other 37
votes, then party A would have received 59% of the vote. We can calculate
party A’s vote share by taking their votes divided by the total votes,
multiplied by 100.
In this case, party A’s vote
share is 53/90 * 100 = 59%.
|
Projections
|
A projection uses trends
and other inputs to project how a set of data may change in the future.
For example, if a trend shows an
increase in the purchase of a product by 50%, a projection can show how it
would impact on our economy should the trend persist. The Reserve Bank of
Australia and The Treasury produce a number of projections assessing the
Australian economy.
|
Quantitative research
|
Quantitative research is
the process of collecting and analysing numerical data. Quantitative data collection methods are much more
structured than qualitative data collection methods. Quantitative data
collection methods include various forms of surveys—online surveys, paper surveys, face-to-face interviews, telephone
interviews, longitudinal studies, and online polls.
|
Qualitative research
|
Qualitative research aims to gather an in-depth understanding of human
behaviour via first-hand observation, face-to-face
interviews, questionnaires, focus groups, participant-observation etc. The
data are generally nonnumerical.
Some other examples of
qualitative data include
·
The reasons why
people like eating at restaurants.
·
The problems people
face when moving house.
|
Range
|
The range represents the
actual spread of data. It is the difference between the highest and lowest
observed values.
For example, the lowest value
of the following data set [2,3,7,9,1,4] is 1
and its highest value is 9, so its range is 9−1=8. As
with calculation of the median, it is helpful to order data observations to
find the highest and lowest values.
Subject coach, Definition of Range (Statistics), 5 February 2018, accessed
30 April 2021
|
Rate
|
The rate simply refers to
the frequency of the occurrence of an event.
For example, if an event is
calculated to occur once every 100 opportunities, the rate would be 1 in 100.
Rates are often used to represent statistics regarding mortality. If we were
to assess deaths due to coronary heart disease in Australia, these can be
expressed as a rate of deaths per 100,000 population. In Australia, men die
from coronary heart disease at a rate of 119 deaths per 100,000 population
and women at a rate of 33 deaths per 100,000 population.
|
Ratio
|
A ratio is used to
compare two quantities, referencing one against the other.
For example, if a poll of 30
people is taken and 20 vote for party A and 10 vote for party B, then party A
has more votes than party B by a ratio of 2:1. Similarly, if the poll
included 70 people, with 40 voting for party A and 30 for party B, party A
now leads party B by a ratio of 4:3.
|
Sample Size
|
When sample surveys are
collected, the sample size indicates the number of participants within
the survey. As the sample is a subsection of the population, the greater the
sample size the more representative it will be of the population, assuming
that bias is removed through random sampling.
|
Sample Survey
|
A sample is part of
subset population, often randomly selected for the purpose of studying the
characteristics of the entire population. When it comes to data collection, a
sample survey is one of the most prominently used methods, used to
collect data relating to a population of interest.
See: Census
|
Significance
|
Statistical significance is used to quantify whether a result is due to the relationship
assessed within a study or due to random chance. When a result boasts
statistical significance, it means that the relationship observed is likely
not due to chance. When undertaking a study, statistical significance
provides important validation for any results that may have been obtained.
When tests are unable to obtain this validation, they are statistically non-significant.
|
Standard Deviation
|
Standard
deviation is the measure of spread most commonly used in statistical practice
when the mean is the measure of centre.
Thus it
measures spread about the mean. Because of its close links with the mean,
standard deviation can be seriously affected if the mean is a poor measure of
location. The standard deviation is also influenced by outliers; it is a good
indicator of the presence of outliers because it is so sensitive to them.
Therefore, the standard deviation is most useful for symmetric distributions
with no outliers (normal distributions).
In an asymmetrical
distribution the two sides will not be mirror images of each other.
The key features of a normal
distribution as seen in the example above:
·
symmetrical shape
·
mode, median and
mean are the same and are together in the centre of the curve
·
there can only be
one mode (i.e. there is only one value which is most frequently observed)
·
most of the data are
clustered around the centre, while the more extreme values on either side of
the centre become less rare as the distance from the centre increases (About
68% of values lie within one standard deviation (σ) away from the mean;
about 95% of the values lie within two standard deviations; and about 99.7%
are within three standard deviations. This is known as the empirical rule or
the 3-sigma rule.)
|
Standard Error and Relative
Standard Error
|
Standard Error (SE) measures
the variability of a given sample. It is found by taking the standard
deviation divided by the square root of the sample size. As the size of the
sample increases, the standard error will decrease. An increase in sample
size leads to a more accurate estimate, and this is reflected in the
measurement of the standard error. The standard error can be used to obtain
confidence in data, with a 95% chance that the true value of a measure lies
within two standard errors of a survey estimate.
For example, a sample with the
standard deviation 1.5 and sample of 15 will have a standard error of 0.39.
However, if the standard deviation were 1.5 and the sample were 60, the
standard error would be 0.19.
The Relative Standard Error
(RSE) provides an expression of the standard error in a simpler form,
using percentages. As the ABS
notes, a RSE of 25% or greater is
considered as unreliable. The relative standard error is taken by dividing
the standard error of a measurement by the measurement itself, and then
expressing the number as a percentage by multiplying it by 100. Where the
standard error is expressed as a number, the relative standard error provides
greater ability to assess the reliability of a measure.
For example, if the standard
error of a measure was 0.02, and the measurement itself was 10, the relative
error would be 0.02 / 10 = 0.002. To represent the RSE, this is multiplied by
100 to obtain a percentage. In this case, the RSE would be 0.002 * 100 =
0.2%.
|
Time Series
|
A time series is a set of
data which can show changes in a variable over time.
For example, the unemployment
rate can be viewed as a time series, highlighting fluctuations over time. The
ABS present time series data in three formats, original, seasonally-adjusted
and trend estimates.
|
Variable
|
A variable
is a data point which provides a measurement. For example, household income
data provides measures of a number of variables, such as gross household
income per week, equivalised disposable household income per week, or labour
force status. Each of these variables provides a different measure.
|
Variance
|
The variance of a set
of data is a measure of how spread out the data is from the mean value. When
measuring variance, a higher number indicates the data is more spread.
|