Miriam Meyerhoff, Vaclav Brezina, Corpus Linguistics, John Benjamins, aggregate data, sociolinguistic variation, sociolinguistic studies, International Journal of Corpus Linguistics, Language in Society, British English, BNC, Cambridge University Press, London, linguistic variables, Journal of Sociolinguistics, previous research, language corpora, linguistic variable, linguistic research, References Anderwald, studies, Linguistic Variation, group differences, Nevalainen, English Negation, DOI, Linguistics, Edinburgh University Press, written language corpora, sociolinguistic, Oxford University Press, meaningful results, linguistic phenomena, Kilgarriff, individual variation, corpus studies, natural distribution, individual differences, null-hypothesis, Mann-Whitney U test, sociolinguistic analysis, test statistic, Social variation, speaker groups, random groups, results
Significant or random? A critical review of sociolinguistic generalisations based on large corpora* Vaclav Brezina and Miriam Meyerhoff Lancaster University
/ Victoria University of Wellington This article offers a critical review of a methodology often employed in corpusbased sociolinguistic studies which make use of aggregate data. This methodology relies on a general comparison of frequencies of a target linguistic variable in socially defined sub-corpora. The main issue with this procedure lies in the fact that it emphasises inter-group differences and ignores within group variation. The methodology thus often yields falsely positive results (with highly significant log-likelihood scores). This article presents evidence which shows that sociolinguistic studies based on aggregate data are in principle unreliable. Using BNC 32, a one million-word corpus of informal speech, it demonstrates that random (and therefore sociolinguistically irrelevant) speaker groupings can often yield statistically significant results. The article offers suggestions for an alternative methodology (using the Mann-Whitney U test), which takes into account within group differences and therefore produces more meaningful results. Keywords: sociolinguistics, methodology, statistics, log-likelihood, aggregation 1. Introduction In essence, corpus linguistics is a quantitative paradigm of linguistic research based on analysis of large amounts of naturally occurring language (collected as spoken and written language corpora). One of the main tenets of corpus linguistics is the idea that (semi)automatic analysis of corpora can help us discover typical patterns of language use (cf. McEnery & Hardie 2012, Biber et al. 1998, * We would like to thank the three anonymous reviewers for their comments and suggestions on an earlier version of this paper; we would also like to thank members of the audience at the first Asia-Pacific Corpus Linguistics conference. INTERNATIONAL JOURNAL
of Corpus Linguistics 19:1 (2014), 128. doi 10.1075/ijcl.19.1.01bre ISSN 13846655 / e-issn 15699811 © John Benjamins Publishing Company
2 Vaclav Brezina and Miriam Meyerhoff Sinclair 1991). The cost that a corpus linguist pays for quantitative robustness of the research lies in the fact that he or she is able to observe many of the patterns only from a distance with limited knowledge of the larger context (cf. Stubbs 2001, Widdowson 2000). Nevertheless, corpus linguistics has proved to bring important insights into the study of lexis (all major dictionaries nowadays are based on large language corpora) and grammar (e.g. Biber et al. 1999) as well as in the analysis of collocational patterns (e.g. Biber 2009, Hunston & Francis 2000). In addition, the wider availability of corpora annotated for the basic speaker characteristics (gender, age, socio-economic status, etc.) has opened a new avenue of research of social patterns in language use. For example, one of the early c orpus-based sociolinguistic studies (Rayson et al. 1997) uses the conversational subcorpus of the British National Corpus (BNC) to investigate lexical preferences of speakers from different social groups. Rayson et al. (1997) use a procedure similar to the keyword method (e.g. Scott 1997, 2001) which identifies lexical units which occur significantly more frequently in one part of the corpus in comparison with another. A different perspective on language variation is offered by Nevalainen & Raumolin-Brunberg (2003) who explore the possibility of sociolinguistic investigations with historical corpora. Using the Corpus of Early English Correspondence, Nevalainen & Raumolin-Brunberg (2003) examine different aspects of language change
in the context of social variation. In contrast to Rayson et al. (1997) who look at lexical variation, Nevalainen & Raumolin-Brunberg (2003) focus on variation inherent in the grammar of English. Nevalainen & RaumolinBrunberg (2003) show that with some frequent variables, sociolinguistic inquiries are meaningful even with a relatively small corpus (cf. Nevalainen 1999). More recently, the topic of the use of corpora in sociolinguistic research has been explored by Baker (2010). In his book, Baker (2010) provides a detailed discussion of corpus methods and their use in sociolinguistic investigations. Reviewing a range of studies that use corpora for sociolinguistic purposes, Baker (2010) shows how different corpus tools can be applied in a number of areas of research. At the same time, Baker (2010) points out that we need to be cautious when interpreting the frequency counts in sociolinguistic data. Baker (2010:56) suggests that "when grouping together a large number of speakers we can overlook differences within groups, which may have a skewing effect on our results". This paper further explores the problem highlighted by Baker (2010). We look at the methodology used in a number of corpus-based sociolinguistic studies and discuss the limits of useful generalisation. A typical example of a large-scale corpus-based sociolinguistic study is Xiao & Tao's (2007) research on amplifiers in British English based on the 100-million-word BNC. The authors claim that in contrast to previous studies, which were based on considerably smaller corpora than the BNC or were not
A critical review of sociolinguistic generalisations 3
based on corpora at all, the advantage of their approach lies in the potential to bring "more reliable results" (Xiao & Tao 2007:248). They identify a number of forms (amplifiers) which are preferred by speakers with different social characteristics (age, gender, socio-economic status, education, etc.). However, rather than looking at the variation in the speech of individual speakers/authors (which would be impracticable with over 5,000 BNC authors and speakers) they compare the occurrence of a large number of linguistic variables in broadly defined subcorpora. They therefore, as most of the corpus-based sociolinguistic studies, rely in their analyses on aggregate data. It is important to realise that aggregating data is a normal procedure in every corpus design. When building large language corpora, we take speech and writing samples produced by a large number of speakers/authors and combine them. In large corpora, although usually available, the information about the social characteristics of individual speakers is difficult to retrieve and use for detailed sociolinguistic analyses which would take into consideration individual variation. Moreover, the samples from some speakers can be relatively short and therefore they cannot be meaningfully used as individual data points. Most of the corpusbased sociolinguistic studies to date have therefore relied on general comparisons of two or more sub-corpora such as the speech of all men in the corpus compared with speech of all women in the corpus, or the speech of all younger speakers compared to the speech of all older speakers, etc. These comparisons will be referred to as `aggregate data methodology'. The following invented example shows the underlying principles of the aggregate data methodology and opens the discussion about some of its shortcomings. In this hypothetical situation, the occurrence of a linguistic variable x is examined in the speech of five female (F1F5) and five male (M1M5) speakers. The distribution of the variable x in 1,000-word speech samples is shown in Table 1.
Table 1. Distribution of linguistic variable x in individual speakers
Individual speakers F1 F2 F3 F4 F5 M1 M2 M3 M4 M5
Freq. of ling. variable x 10 10 100 10 10 20 20 20 20 20
Sample size 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
4 Vaclav Brezina and Miriam Meyerhoff
Just by visually inspecting Table 1 we can see that there is a clear tendency for the male speakers to use the variable x more often, in fact twice as often as the female speakers. This generalisation is true with the exception of one female speaker (F3) who strongly prefers the variable x. When we, however, aggregate the data, we get a very different picture as can be seen from the contingency table below (Table 2). Here, the group of female speakers appear to use the variable x more often than the male speakers.
Table 2. Contingency table based on aggregate data
Occurrences of variable x Corpus size (tokens)
Female speakers 140 5,000
Male speakers 100 5,000
When statistical tests (log-likelihood test, chi-square test or any other test used for the analysis of categorical data) are applied to compare male and female data in Table 2, the incorrect observation that female speakers prefer the variable x is only confirmed. The following are the test statistics and p values of the chi-square test and log-likelihood: І df.=1, 6.67, p<.01; LL-score 6.7, p<.01. In both cases the null hypothesis is safely rejected. This invented example thus points to the hidden shortcomings of the aggregate data procedure; however, we may ask to what extent these shortcomings have a bearing on the results of the corpus-based sociolinguistic studies with real corpus data? Table 3 reviews nine recent sociolinguistic studies based on language corpora. It provides basic information about the studies (linguistic and social variables examined) paying particular attention to methodological aspects of the studies such as the size of the corpus, number of speakers in the corpus and statistics used. Despite the fact that these studies represent a heterogeneous group in terms of the data used and the individual methodological decisions, (some of) their quantitative findings rely on the aggregate data methodology. We want to note here that in this article, we do not question the overall contribution made by the studies in Table 3. Our aim is to point out some of the problems with the aggregate data methodology that these studies use. As we explained above, this is a standard methodology following from the very corpus design. It is also important to note that in addition to the use of aggregate data methodology, some of the studies from Table 3 employ a statistical measure to establish whether observed differences between the individual sub-corpora in question are due to chance or are statistically significant. The statistic (most often the loglikelihood score) is used both when looking at a single linguistic variable as well as when comparing two sub-corpora using the keyword procedure (Scott 1997,
Table 3. Recent studies using aggregate corpus data in sociolinguistic analyses
A critical review of sociolinguistic generalisations 5
No. speakers Statistics used
Torgersen et al. (2011) Murphy (2010)
pragmatic markers (you know, you know what I mean, you know what I`m saying among others) hedges, taboo language
fuck (i.e. fuck, fucking, fucks, fucked )
slang words, markers of stance, evaluative adjectives
Xiao & Tao (2007) amplifiers
Anderwald (2005, 2002) McEnery & Xiao (2004) Schmid (2003)
negative concord fuck and derivatives a large number of gender-related variables
gender, ethnicity, geographical location
LIC-2 (865k) COLT-2 (123k) working class teenagers
age (20s, 40s, 70s, 80s), gender
CAG Limerick, Corck (90k), consists of : MAC (47k) & FAC (49k) random casual conversations
age (20s, 40s, 70s, 80s), gender
CAG Limerick, Corck (90k), consists of : MAC (47k) & FAC (49k) random casual conversations
age (1525, 3060)
American conversation corpus-sample (400k)
gender, age, education level, BNC (100M) social class, audience gender, audience age
gender, age, social class, education gender
BNC-spoken (8M) BNC-spoken (8M)
100 19 20 20 139 (85, 54) 5,500 (?) 1,281 (?) (?)
Macaulay (2002) you know
age, gender, social class
Ayr interviews (120k) Glasgow same sex dyadic conversations (125k)
Ayr (12) Glasgow (?)
LL-score, Spread (i.e. no. of users per 100 speakers) Keywords (LL-score?), frequencies per 1M words compared Frequencies per 1M words compared Key word analysis, LL-score, frequencies per 1M words compared LL-score, frequencies per 1M, cross tabulation Chi-square LL-score difference coefficient formula, hypergeometrical approximation of the binomial distribution None-frequencies per 1,000 words compared
6 Vaclav Brezina and Miriam Meyerhoff
2001). In essence, the latter is only a multiple iteration of the former. This becomes clear when we realise that the keyword procedure relies on a pairwise comparison of frequencies in wordlists based on two different (sub)corpora and identification of those items whose frequencies of occurrence differ significantly between the (sub)corpora. The general principle, however, is the same as with comparisons focusing on a single linguistic variable (or a small number of linguistic variables). Table 4 shows sociolinguistic findings of the studies reviewed in Table 3 related to the use of six selected linguistic variables (fuck (i.e. fuck, fucking, fucks and fucked), lovely, you know, sort of, really and negative concord). As can be seen
Table 4. Results of recent studies using aggregate corpus data in sociolinguistic analyses
Linguistic Previous research Social variable Corpora used variable
McEnery and Xiao (2004)
Murphy (2010) Murphy (2009)
Schmid (2003) Barbieri (2008)
lovely you know
Schmid (2003) Torgersen et al. (2011) Macaulay (2002)
Schmid (2003) Schmid (2003) Barbieri (2008)
Xiao and Tao (2007)
Schmid (2003) Barbieri (2008) (2005, 2002)
age gender, age
CAGLimerick, Corck(90k) CAGLimerick, Corck(90k)
gender, age, class Ayr (120k)
gender, age, class BNC (100M)
gender age region
BNC-spoken (8M) American conversation corpus-sample (400k) BNC-sample (5M?)
gender YES (M+) age YES (Y+) class YES (WC+) YES (Y+) gender YES (M+) age YES (Y+) YES (M+) YES (Y+) YES (F+) NO effect gender YES (F+) age YES (adults+) class YES (depending on the position in clause) YES (F+) YES (M+) YES (Y+) gender YES (F+) age NO class YES (MC+) YES (F+) YES (Y+) YES (S+)
Only social variables comparable with the present research are listed (i.e. speaker's gender, age, socioeconomic status and region). Preferred by: M+ (males), F+ (females), Y+ (younger speakers), O+ (older speakers), WC+ (workingclass speakers), MC+ (middle-class speakers), S+ (South).
A critical review of sociolinguistic generalisations 7 from the last column (Effect found?) a large majority (8 out of 9) of the studies based on aggregate data found an effect for some of the basic speaker characteristics such as gender, age, social class or region. However, a question that needs to be asked in relation to these results is to what extent the findings represent a reliable reflection of the social reality and to what extent they suffer from the shortcomings of the aggregate data methodology. Ideally, one would replicate each of the studies in Table 3, using the corpora and the methods of aggregation in the original analyses and compare this with a detailed analysis which pays attention to individual differences between speakers. This is, however, not feasible, either because the corpus is not in the public domain, or in the case of studies based on the BNC because of the very large number of individual speakers and the extreme variance in the amount of speech produced by them. This degree of inter-speaker variance makes it impossible to normalise the data in the necessary ways and explore it employing the methods we set out to use. This study, therefore, offers a controlled investigation of the six selected linguistic variables (fuck, lovely, you know, sort of, really and negative concord) based on BNC 32. BNC 32 is a socially-balanced one-million word corpus of British speech extracted by the lead author form the BNC-demographic. Our aim is to demonstrate that random (and therefore sociolinguistically irrelevant) speaker groupings can often yield statistically significant results if general comparisons based on aggregate data are used. In doing so we will provide a motivation for using metrics that are able to account for both individual and social factors in the variation observed. 2. Data and methodology The analyses reported in this article are based on BNC 321, a socially-balanced one-million-word corpus of British informal conversation, which was extracted from the demographic part of the BNC. It represents the speech of 32 British English speakers 16 women and 16 men (see Appendix). The speakers were selected according to the following criteria: i. Each speaker contributes between 13 and 64 thousand running words; ii. The speakers form a balanced sample in terms of gender, age and socioeco- nomic status; iii. The speakers come from various parts of the UK. 1. An extended version of BNC 32, BNC 64, is available from http://corpora.lancs.ac.uk/ bnc64/ (accessed December 2013). The search enviromnemt allows complex search patterns and provides automatic statistical evaluation of the data.
8 Vaclav Brezina and Miriam Meyerhoff
The following table summarises the main features of the corpus:
Table 5. Structure of BNC 32
Tokens No. of Speaker speakers gender
1.04 mil. 32
16 male 16 female
Speaker age A: 10 B: 13 C: 9
Speaker SES AB: 7 C1: 9 C2: 9 DE: 5 UU: 2
Genre informal conversation
Discourse mode highly interactive
Variety UK English
Period early 1990s
A: 1534; B: 3554; C: 55+ Socio-economic status: AB: managerial, administrative, professional; C1: junior management, supervisory, professional; C2: skilled manual; DE: semi- or unskilled; UU: unknown
The major advantage of BNC 32 is the fact that (unlike the majority of commonly used corpora) it enables us to search for language forms in the speech of individual speakers. This enables us to compare aggregate data methodology with analyses which pay attention to inter-speaker differences. MonoConc Pro (Barlow 2002) was used to search for the following eight linguistic variables: fuck (fuck, fucking, fucks, fucked), lovely, you know, sort of, really, negative concord, the and some (see Table 6).
Table 6. Linguistic variables
Linguistic variable fuck lovely you know sort of really the some
AF in BNC 32 535 488 3,353 1,168 1,766 138 29,251 1,810
NF 5.51 5.03 34.56 12.04 18.20 1.42 301.48 20.70
Linguistic description taboo word adjective discourse marker hedge amplifier negative concord definite article determiner, pronoun
Level Lexis Lexis Discourse Discourse Discourse Grammar dummy variable Dummy variable
AF: absolute (raw) frequency NF: normalised frequency per 10,000 words
The first six linguistic variables were chosen according to the following criteria:
i. They appear in the studies reviewed in Table 3 and are reported to be used differently by speakers of various social groups (based on gender, age, social class and region); ii. In BNC 32 they occur with frequencies which vary from 1.4 to 34.6 in 10,000 words; iii. They represent a variety of lexical, discourse and grammatical forms.
A critical review of sociolinguistic generalisations 9
The last two variables the definite article the and the form some, which functions as a determiner or a pronoun were chosen as "dummy variables" for the purposes of comparison. The definite article is the second most frequent word in the whole corpus (after the personal pronoun I) and unlike I does not seem to index any social-group membership. Similarly, some does not seem to have any social indexical properties. In addition, the form occurs with a normalised frequency of 20.7 per 10,000 words, which is more comparable with the other six linguistic variables than the definite article this means that we can check whether distributional patterns are perhaps an artefact of the overall frequency of a lexeme in the large corpus. In short, these dummy variables allow us to test the distribution of forms that are not hypothesised to have any social indexicality. The following steps were employed in order to evaluate two different corpuscomparison methodologies: aggregate data methodology and comparisons taking into account inter-speaker differences. i. Replication of previous research with BNC 32 based on aggregate data; ii. Analysis of random speaker groups with aggregate data methodology; iii. Replication of previous research with BNC 32 taking into account inter- speaker differences; iv. Analysis of random speaker groups by paying attention to inter-speaker dif- ferences.
2.1Replication of previous research with BNC 32: Aggregate data methodology
The first analysis was designed to follow a similar methodology as the majority of studies from Table 3. It was therefore based on general comparisons of individual sub-corpora of BNC 32 (aggregate data). We compared the occurrences of the eight linguistic variables (see Table 6) in BNC 32 sub-corpora defined by gender, age, socio-economic status and region of the speakers. The following pair-wise comparisons were carried out:
Table 7. Structure of BNC 32
Social variable Gender Age Socio-economic status Region
Pairwise comparisons male younger speakers (1534) working class South
female older speakers (3582) middle class Midlands-North
10 Vaclav Brezina and Miriam Meyerhoff
For the purposes of the present research, the two age groups were defined as follows: younger speakers age range 1534, mean age 26.8; older speakers age range 3582, mean age 53.5. Working class speakers held skilled manual (BNC code: C2) or semi- or unskilled (BNC code: DE) jobs, while middle class speakers worked at managerial, administrative, professional and supervisory positions (BNC codes: AB and C1). The regional split (South vs. Midlands-North) was chosen to follow the results reported in Anderwald's (2005) study on negative concord also replicated here. For general sub-corpora comparison a log-likelihood (LL) score was calculated in order to establish whether two sub-corpora under comparison were samples from the same population or differed significantly from each other (cf. Rayson et al. 2004). We have chosen LL-score as a statistical tool commonly used for comparing two (sub)corpora and for the identification of keywords (see above). It is important to note that the choice of a particular statistical procedure is secondary to the fact that by aggregating data we lose track of the individual speaker differences (as exemplified in Section 1). The statistical procedure (LL-score in this case) only reinforces possibly incorrect observations. Another statistical procedure commonly used in corpus-based sociolinguistic studies with aggregate data is the chi-square statistic. Although the equations for calculation of chi-square and log-likelihood differ, the basic principle of comparison between expected and observed values in two (sub)corpora is similar. The criticisms of aggregate data methodology with the log-likelihood statistic explored in this article can therefore be extended also to situations in which chi-square is used. LL-score was calculated in the following way:
i. Aggregate data were entered in the contingency table displayed in Table 8; ii. Expected values (values which we would expect if the sub-corpora were sam- ples from the same population) were calculated according to the following formulae:
( + ) 1 = +
( + ) 2 = + iii. The following formula was used to calculate the test statistic:
2 = 2 Ч ln 1 + Ч ln 2
A critical review of sociolinguistic generalisations 11
iv. The test statistic was compared with the cut-off points provided in Table 9. For 2Ч2 contingency tables (comparison of 2 sub-corpora), the test statistic 3.84 or higher is significant at the level of p<.05.
Table 8. Comparison of two corpora: contingency table
Linguistic variable (observed value) Other words in the corpus Size of the corpus (words)
Subcorpus 1 a c-a c
Subcorpus 2 b d-b d
Table 9. LL-test statistic cut-off values
LL-score 3.84 6.63 10.83
p value <.05 <.01 <.001
2.2 Aggregate data with random speaker groups The second analysis applied the same methodology described in Section 2.1 to random speaker groupings. Random Integer Set Generator (www.random.org) was used to randomly assign the 32 speakers from BNC 32 into 500 group pairs (see Figure 1). This sampling procedure was repeated three times with the final number of 1,500 random assignments. The occurrence of the eight linguistic variables was established for each random group of speakers and each random pair was compared calculating the log-likelihood score. Finally, the percentage of statistically significant differences (LL-score>3.84, p<.05) between the pairs of random groups was calculated for each of the linguistic variables (see Figure 1). The percentage of statistically significant differences between random samples was used as a proxy measure of the (un)reliability of the aggregate data methodology with the LL statistic, cf. Johnson 2009. The larger the percentage of significant random comparisons was, the larger the number of false positive results using this methodology.
12 Vaclav Brezina and Miriam Meyerhoff
BNC 32 (32 speakers)
Random grouping 1 Random Random 16 speakers 16 speakers
Random grouping 2
Random grouping 500
Random Random 16 speakers 16 speakers
Random Random 16 speakers 16 speakers
Pairwise comparison (LL-score 1)
Pairwise comparison (LL-score 2)
Pairwise comparison (LL-score 500)
Percentage of LL-scores > 3.84. p < .05 Repeat the sampling and comparison procedure three times -> 1,500 random assignments & pairwise comparisons
Figure 1. BNC 32: Random grouping and comparison
2.3Replication of previous research with BNC 32: Individual speaker differences The main issue with the procedure described in Section 2.1 (which was also followed for comparison of random speaker groupings described in Section 2.2) lies in the fact that it emphasises inter-group differences and ignores within group variation (cf. Deutschmann 2003:130). As will be shown below, the methodology often yields falsely positive results with very low p values (cf. also Brezina 2013). In order to analyse meaningful social variation, we need to account for both social and individual variation in the data. BNC 32 was therefore searched for the eight linguistic variables in the speech of the individual speakers. Since the total size of each sample is different (varies between 12 and 57 thousand running words) the absolute frequencies of occurrence of the eight linguistic variables were normalised per 10,000 words to allow for comparison. For the comparisons which take into account variation between individual speakers, the non-parametric Mann-Whitney U test was used. The non-parametric test was chosen because all the eight linguistic variables occur with skewed distributions. As all non-parametric tests, Mann-Whitney takes into account the ranks instead of the exact values (Field 2009:542, Gibbons & Chakraborti 2003:268). No (or little) difference between the sums of ranks for each of the compared
A critical review of sociolinguistic generalisations 13
groups indicates that the samples come from the same population. This test is therefore a powerful tool for comparing within-group coherence as well as intergroup variation when the normality assumption in the data is violated. The Mann-Whitney U test was calculated in the following way (Field 2009:542ff, Mann & Whitney 1947:51): all cases were ranked from the lowest to the highest (lowest=1) in the whole dataset ignoring the group membership. Tight ranks were given the same rank (mean value). The test statistic was calculated in the following way:
In the equation above, n1 and n2 are sample sizes of groups 1 and 2 and R1 is the sum of ranks for group 1 (group with higher ranks). For n1=16 and n2=16 the U statistic smaller than 75 is significant at .05 level.
2.4 Inter-speaker differences and random speaker groups Finally, in the fourth analysis we used the same 1,500 random speaker groupings as discussed in Section 2.2 in order to ascertain whether more reliable results can be obtained if we take into account inter-speaker differences. Instead of calculating log-likelihood scores for general sub-corpora comparisons, a series of Mann-Whitney U tests were employed (see the description of the statistic in Section 2.3). Finally, the percentage of statistically significant differences between the pairs of random groups was calculated for each of the linguistic variables. Again, this percentage was used as a proxy measure of the reliability of the methodology.
3. Analyses In this section, we implement the different ways of analysing the data we introduced in the last section. We demonstrate the methodological problems with aggregating data. First, we offer a sociolinguistic analysis of the eight linguistic variables using the traditional LL-score. This is followed by an analysis of random speaker groupings intended to evaluate the reliability of the previous procedure. In the last two sub-sections, the two analyses based on aggregation (and the LLscore) are contrasted with an approach that takes into consideration inter-speaker differences.
14 Vaclav Brezina and Miriam Meyerhoff
3.1 Replication of previous research with BNC 32: Aggregate data
Let us start by looking at the results of the analyses based on aggregate data. Table 10 shows a series of comparisons between men and women (second column), younger and older speakers (third column), working class and middle class speakers (fourth column), and speakers from the South and speakers from the Midlands and the North (fifth column) in their use of the eight linguistic variables.
Table 10. Social variation in BNC 32 subcorpora: aggregate data and LL-score
Linguistic variable fuck lovely you know sort of really the some
Gender 761.89*** (M+) 40.20*** (F+) 7.41** (F+) 6.62* (M+) 1.88 46.18*** (F+) 210.03*** (M+) 6.84** (F+)
Age 558.40*** (Y+) 1.30 59.93*** (O+) 0.46 25.14*** (Y+) 0.02 19.24*** (O+) 0.21
Socioeconomic status 646.00*** (WC+) 20.03*** (MC+) 1.55 47.02*** (WC+) 12.31*** (MC+) 2.73 97.55*** (MC+) 0.71
Region 347.23*** (S+) 1.06 74.28*** (M-N+) 0.34 97.31*** (S+) 17.74*** (M-N+) 0.37 s2.47
Preferred by: M+ (males), F+ (females), Y+ (younger speakers), O+ (older speakers), WC+ (working-class speakers), MC+ (middle-class speakers), S+ (South), M-N+ (Midlands & North) * p<.05; ** p<.01; *** p<.001; Other: n.s.
The results (obtained from the aggregate data) show a number of statistically significant differences. The expletive fuck is preferred by young speakers, male speakers, working-class speakers and speakers from the South. The adjective lovely occurs more frequently in female and middle-class speech. The discourse marker you know is preferred by female speakers, older speakers and speakers from the Midlands and the North. The classic hedge sort of seems to be a marker of gender (although there is a male preference, contrary to what we might expect from previous sociolinguistic studies of hedging, e.g. Holmes 1995b) and socio-economic status (working-class preference). The intensifier really occurs significantly more in the speech of younger speakers, middle-class speakers and speakers from the South (cf. Ito & Tagliamonte 2003, who in their study of British English found it was associated with middle-aged women). Negative concord appears to be typical of female speech (again, contrary to what previous sociolinguistic work would lead us to expect, e.g. Cheshire 1992, Smith 2001) and is typical of the speech of speakers from the Midlands and the North. Finally, and crucially for our argument, the distribution of the two dummy variables also shows a number of statistically significant results. The definite article appears to be significant for three independent variables: gender (male preference), age (older speakers' preference) and socioeconomic status
A critical review of sociolinguistic generalisations 15 (middle-class speakers' preference). The determiner/pronoun some appears to be preferred by female speakers. Overall, we can see that 62 per cent of the comparisons (20/32) based on aggregate data yield statistically significant results (at least at the .05 level). Seventeen (over 53 per cent) of the comparisons are significant at the .001 level. This is not surprising since a large majority of the corpus-based sociolinguistic studies found in the literature report statistically significant differences between different speaker groups. On the other hand, the fact that the proportions of significant and highly significant results are so large may be seen as a warning signal. In particular, there does not seem to be any obvious reason for the dummy variables the and some to have socially-motivated distributions. The doubt about the validity of such results is also raised by Schmid (2003:5) who observes in passing that with aggregate data and the chi-square statistic (similar to log-likelihood) "almost all observed differences [in his research] would have turned out significant at 99% level".2 We may therefore ask: do all these linguistic variables really index social group membership? Does the distribution of these linguistic variables (analysed with the aid of aggregate data) reflect social and linguistic reality? In order to answer these questions, in the next section we will explore the distribution of the eight linguistic variables in random speaker groupings (which do not have any bearing on social reality). If the same statistical procedure (based on aggregate data) yields many significant results (beyond chance) we can conclude that results in Table 10 have only limited validity. 3.2 Aggregate data with random speaker groups There are 300,540,195 possible ways (combinations) in which 32 speakers can be assigned to two groups of 16. Only a very small number of these groups are interesting from the point of view of sociolinguistics, i.e. only a few social groupings reflect salient social reality expressed in macro-variables such as gender, age and socio-economic status. The rest of the groupings represent random group assignments. Table 11 below shows percentages of statistically significant results in three 500 random groupings for the eight dependent variables. These percentages r epresent a measure of (un)reliability of the aggregate data comparisons with the log-likelihood 2. However, Schmid (2003) does not question the fact that despite opting for a different statistical procedure (hypergeometrical approximation of the binominal distribution) over 80 per cent of the comparisons in his results based on aggregate gender data were statistically significant at least at the .05 level.
16 Vaclav Brezina and Miriam Meyerhoff
statistic. We can see that the percentages of statistically significant results are relatively stable across the three 500-group samples for all the linguistic variables. The individual proportions of the false positive results vary from 48 per cent (lovely and some) to almost 99 per cent (fuck). Such high percentages of error represent a serious problem for the given methodology and may provide some indication as to why in the last section we found significant differences in the distribution patterns of some of these variables that run counter to established sociolinguistic wisdom. It should be noted that with the alpha level of the test set to .05 we would expect approximately 5 per cent error rates with the random data.
Table 11. Random variation in BNC 32 per cent of statistical significant results with LL-scores
fuck lovely you know
random 500 I random 500 II random 500 III Mean SD
99.0% 98.8% 99.0% 98.93% 0.1%
48.2% 48.0% 48.2% 48.13% 0.1%
76.4% 77.0% 81% 78.13% 2.5%
sort of 80% 77.6% 77.8% 78.47% 1.3%
really 63.6% 56.8% 62.4% 60.93% 3.6%
58% 58.2% 55.4% 57.20% 1.6%
81.8% 79.2% 80.0% 80.33% 1.3%
51.0% 46.0% 47.4% 48.13% 2.58%
A good example of the shortcoming of the analysis with aggregate data is the linguistic variable fuck. As we have seen, almost 99 per cent of random groupings appear statistically significant. This is due to the fact that in BNC 32 the taboo forms are used by only a small number of speakers (9, see Table 12). Moreover, one speaker (M15) uses the taboo forms with a frequency that is larger than the frequency of fuck in the speech of the remaining eight speakers combined. This results in a situation in which a single speaker sways the balance towards whichever group he is in. The null hypothesis is therefore almost always rejected.
Table 12. Proportion of uses of the eight linguistic variables in BNC 32
Linguistic variable fuck lovely you know sort of really the some
No. of users 9 31 32 31 32 20 32 32
Per cent of users 28% 97% 100% 97% 100% 63% 100% 100%
A critical review of sociolinguistic generalisations 17 A similar explanation can be offered for the almost 60 per cent of false hits in the case of negative concord. Negative concord occurs in the speech of 20 out of 32 speakers (63 per cent). However, only seven speakers use negative concord with a frequency greater than two per 10,000 words. If these seven speakers (or a majority of them) happen to be in one of the compared random groups, this group will differ significantly from the other random group. The fact of unequal distribution of linguistic variables in social sub-corpora has been pointed out by a number of researchers (Torgersen et al. 2011, Baker 2010, Gries 2006, Deutschmann 2003:230). Baker (2010:39) suggests that a measure of dispersion should be used in order to identify "whether a feature is typical of a population or localised to a few language users or specific cases". Similarly, Gries (2006:194) concludes that "inspecting and reporting by-subjects or by-file results should belong to the standard procedure of interpreting corpus-linguistic data." The idea of taking into account the fact that not all speakers in the corpus use the analysed linguistic variable has been picked up by Torgersen et al. (2011) in their study of pragmatic markers in spoken London English. Torgersen et al. (2011) employ a measure which they refer to as `spread'. Spread is defined as the "proportion of speakers in a group that use a given PM [pragmatic marker]" (Torgersen et al. 2011:101). For our data, spread is reported in the third column of Table 12. As can be seen from Table 12, the spread is 28 per cent for the expletive fuck and 63 per cent for negative concord. This clearly indicates unequal distribution among individual speakers. For the rest of the variables, however, the spread is almost 100 per cent, which should indicate equal distribution among speakers, yet the proportions of false positive results for these forms are very high (4880%). This points to the fact that spread as a measure can be a useful indication of unequal distributions for only a limited number of clear-cut cases. Unfortunately, this measure does not take into consideration different frequencies of use of linguistic variables by different speakers. In the case of the other six variables (lovely, you know, sort of, really, the and some) the high proportions of false positive results cannot be accounted for by pointing to the numbers of non-users (e.g. with the aid of the spread measure) as almost all speakers employ these variables in their speech (although with very different frequencies). We argue that the reason for high error rates needs to be sought in the very fact of aggregation of the data and the way this is combined with the procedure of null hypothesis testing. If our analyses rely solely on aggregate data two things happen: (i) we gain much larger amounts of data which can be fed into the equation of the statistical test and (ii) we lose information about inter-speaker variation. This leads to
18 Vaclav Brezina and Miriam Meyerhoff
two separate, but interrelated consequences: (I) the null-hypothesis testing often generates a significant result merely by virtue of the quantity of the data and natural distribution of words in corpora (cf. also Gries 2005, Johnson 2009, Kilgarriff 2005, Nickerson 2000). Kilgarriff (2005) describes the situation as follows:
Language is non-random and hence, when we look at linguistic phenomena in
corpora, the null hypothesis will never be true. In corpus studies, we frequently
do have enough data, so the fact that a relation between two phenomena is de-
monstrably non-random, does not support the inference that it is not arbitrary.
(II) A small number of speakers with strong preferences for a given linguistic variable can increase considerably the mean for one group. Since with aggregate data we do not look into the distribution of a linguistic variable in individual speakers the results of broad comparisons of two sub-corpora are accepted as social reality. However, as we have seen, I) and II) lead to high proportions of statistically significant results with random data. This thus presents a serious problem for the aggregate data methodology. In summary, we have seen that for a variety of reasons aggregate data with the log-likelihood statistic (which are used in many of the studies reviewed in Table 3) yield statistically significant results for a large proportion of sociolinguistically meaningless (i.e. random) speaker configurations. The differences reported in the studies based on this methodology are thus more likely to be a function of the method used rather than an indication of sociolinguistically meaningful variation.
3.3Replication of previous research with BNC 32: Individual speaker differences Instead of having to rely on (unreliable) aggregate data and general comparisons with the log-likelihood statistic, in BNC 32 we can analyse the occurrences of linguistic variables in the speech of individual speakers (with the aid of the Mann-Whitney U test3). These analyses take into consideration both social group membership and individual speaker variation. Table 13 reports on the results of a series of Mann-Whitney U tests, which take into account inter-speaker variation. We can see that the results in Table 13 are much more modest compared to
3. Although the Mann-Whitney U test is not commonly used in corpus comparison studies, it has been employed successfully by Kilgarriff (2001:115116) to compare male and female speech and by Holmes (1995a) in her study of initial /t/ in New Zealand English.
A critical review of sociolinguistic generalisations 19
results based on general comparisons with aggregate data displayed in Table 10. In Table 13, only four out of 32 (12.5 per cent) of comparisons are statistically significant with additional two comparisons with borderline significance values. This is in a sharp contrast with the results reported in Table 10, 62 per cent of which reached .05 significance level and 53 per cent .001 significance level.
Table 13. Social variation in BNC 32 subcorpora: individual differences (Mann-Whitney U)
fuck lovely you know sort of really the some
Gender p=.465 p=.007** (F+) p=.867 p=.423 p=.402 p=.083 p=.021* (M+) p=.539
Age p=.005** (Y+) p=.458 p=.235 p=.984 p=.219 p=.391 p=.235 p=.535
Socioeconomic status p=290 p=.064 p=.580 p=.667 p=.334 p=.057(*) (WC+) p=.473 p=.166
Region p=703 p=.948 p=.053(*)(M-N+) p=.586 p=.048* (S+) p=.481 p=.263 p=.444
Preferred by: M+ (males), F+ (females), Y+ (younger speakers), O+ (older speakers), WC+ (working-class speakers), MC+ (middle-class speakers), S+ (South), M-N+ (Midlands & North) * p<.05; ** p<.01; (*) nearly significant; Other: n.s.
The sociolinguistic picture reported in Table 13 is as follows: the taboo forms fuck are preferred by younger speakers, the form lovely by female speakers. Really occurs more frequently in the speech of speakers coming from the South. The definite article the is preferred by male speakers. The borderline cases include the preference of speakers from the Midlands and the North for you know and working-class speakers for negative concord (which would be in line with previous studies of negative concord in English, Chambers 2003). Neither sort of nor some show any statistically significant social distributions. At this stage a brief note needs to be made about the "dummy variable" the. The definite article turned out to be significant for the gender variable (preference by male speakers) in the Mann-Whitney U test. Note, however, that when the inter-speaker differences were taken into consideration, neither age nor socioeconomic status were significant (cf. Table 10 which presents aggregate data results). The has been chosen as a "dummy variable" since it as a very frequent grammatical word with general distribution does not seem to index any social-group membership. This therefore raises the following questions: is the gender-based variation in the use of the definite article reported in Table 13 genuine? If so, what does it signal in the linguistic and social reality?
20 Vaclav Brezina and Miriam Meyerhoff When we look at the distributions of the and other frequent linguistic variables in the speech of individual BNC 32 speakers, we can see that the definite article competes with personal and possessive pronouns (especially those in singular). In fact, there is a negative correlation (rs=-.371, p<.05) between the and the personal and possesive pronouns I, me, my, you, your, he, him, his and she, her which in informal conversation index the speaker, the addressee and typically people known to the speaker/addressee (friends, family, neighbours, etc.). Moreover, as can be seen in Figure 2, the use of personal pronouns (and the) clearly distinguishes the male and female speech.
300 Gender M F 200
I, me, my, you, your, he, him, his, she, her
Figure 2. The definite article and selected personal and possessive pronouns in BNC 32 This suggests that the linguistic behaviour of the male and female speakers may differ along a more complex scale similar to Biber's (1991 ) involved vs. informational dimension (cf. Rayson et al. 1997).
A critical review of sociolinguistic generalisations 21
3.4 Inter-speaker differences and random speaker groups
In the final analysis, let us look at the methodology used in Section 3.3 (which evaluates social group membership as well as inter-speaker differences), this time employed with random data. With general sub-corpora comparisons which used the log-likelihood statistic a large proportion (4899 per cent) of the random group comparisons turned out to be statistically significant at the .05 level. Table 14 shows the comparisons of the same random groups using the Mann-Whitney U test, which takes into account the distribution of the variables in the speech of the individual speakers.
Table 14. Random variation in BNC 32 per cent of statistical significant results with Mann-Whitney U
Grouping random 500 I random 500 II random 500 III Mean SD
fuck 1% 1.4% 2% 1.47% 0.5%
lovely 6.2% 4.6% 3.6% 4.80% 1.3%
you know sort of
6.2% 5.6% 5.6% 5.80% 0.3%
3.6% 5.8% 2.6% 4.00% 1.6%
really 3.8% 3.4% 5.4% 4.20% 1.1%
4.2% 2.6% 4.4% 3.73% 1.0%
5.2% 5.0% 5.2% 5.13% 0.1%
4.6% 5.2% 5.0% 4.93% 0.3%
We can see that the proportion of false positive results is stable across the three 500 grouping samples. Moreover the error rates are relatively low (varying from 1.4 per cent to 5.8 per cent). In all but one case (you know) the proportion of statistically significant random group comparisons are below or around the acceptable 5 per cent level. This shows that for assessing the statistical significance of sociolinguistic data, employing the Mann-Whitney U test to compare the distributions of linguistic variables in the speech of individual speakers is a much more reliable procedure than general sub-corpora comparisons with the log-likelihood statistic (aggregate data). This general finding draws corpus linguistics practice more closely in line with emerging best practice in quantitative sociolinguistics. For some time, ethnography-based studies of sociolinguistic variation, have stressed the importance of understanding the inter-individual variation that lies below aggregated categories such as "women" and "men"' (Eckert 1989, 2000) and a mixed-method approach that informs quantitative analysis with qualitative analysis has become quite normal especially in the study of gendered identities, cf. Bucholtz (1999), Eckert (2000), Kiesling (1998) and Meyerhoff (1999). More recently, Johnson (2009) has made a cogent case for all quantitative studies in the Labovian tradition to include individual speaker as a random
22 Vaclav Brezina and Miriam Meyerhoff
effect in Multivariate analyses. While the design of some major corpora used in corpus linguistics may make it difficult or impossible to implement individual speaker as a random effect in our statistical analyses, Johnson's (2009) underlying motivation for recommending this practice is very similar to the motivation provided here in arguing for the merits of non-parametric tests on large corpora such as the BNC. That is, we are all motivated to explore the data as reliably as possible.
Table 15 summarises the findings reported in the literature for the six selected linguistic variables and compares them with the results of the analyses reported in this study.
Table 15. Overview of the results
Linguistic Predicted effect variable (based on the literature)
BNC 32 social: aggregate (Sect. 3.1)
fuck lovely you know sort of really the some
gender (M+) age (Y+) class (WC+) gender (F+) gender mixed (NO, F+) age (adults+) class (complex) gender (M+) age (Y+) gender (F+) age mixed (NO, Y+) class (MC+) region (S+) - -
gender (M+) age (Y+) class (WC+) region (S+) gender (F+) class (MC+) gender (F+) age (O+) region (M-N+) gender (M+) class (WC+) age (Y+) class (MC+) gender (F+) region (M-N+) gender (M+) age (O+) class (MC+) gender (F+)
BNC 32 random: aggregate (Sect. 3.2) 98.93% 48.13% 78.13% 78.47% 60.93% 57.20% 80.33% 48.13%
BNC 32 social: individual (Sect. 3.3) age (Y+)
BNC 32 random: individual (Sect. 3.4) 1.47%
- region (S+)
class (WC+, p=.057) gender (M+)
A critical review of sociolinguistic generalisations 23 We can clearly see that there is a good match between the predicted effects (based on the literature) and the results of general comparison of BNC 32 sub-corpora (column 3). The only difference is the predicted effect of the region on the negative concord (S+) which does not correspond with the findings of this study which suggests that speakers from the Midlands and the North prefer negative concord to speakers from the South. However, the overall good match is not surprising since similar methodologies based on aggregate data were used. At the same time, the aggregate data methodology yielded very large percentages of false positive results, which casts a doubt on both the results of BNC 32 analyses with aggregate data and the results reported in the literature. Interestingly, there is a proportional relationship between the number of statistically significant results reported in column 3 (for BNC 32) and the percentages of false-positive results in column 4. This points to the fact that the more prone a particular linguistic variable is to spurious results, the more sociolinguistic differences we can get with the aggregate data methodology. On the other hand, when we took into consideration inter-speaker differences we got a considerably smaller number of effects for the four social variables (see Table 15, column 5). At the same time, random data comparisons yielded errors which were usually below or around the 5 per cent level. As the results of this study show, it is very important to choose a reliable methodology for comparing corpus data. We have seen that comparisons based on aggregate data using the log-likelihood statistic yielded spurious results with very little bearing on social reality. In this respect, the procedure that took into consideration the variation between individual speakers and used the MannWhitney U test proved to be much more reliable. Our findings point to an important general principle that needs to be considered when making sociolinguistic generalisations based on corpus data. A corpus (no matter how large) should never be regarded as a magic box which is able to provide fast and ready-made answers to any sociolinguistic questions we chose to ask. We therefore suggest the following steps as a guide: i. Explore the functions of the analysed linguistic forms in the context in which they appear in the speech of individual speakers4; ii. Consider the differences in the means (medians) between social groups as well as the individual differences between speakers and assess their linguistic meaningfulness; 4. This stage of analysis has not been the focus of our critical review in this paper. We are aware (as an anonymous reviewer notes) that some, if not all, of the studies in Table 3 attempt to be accountable to context in this way.
24 Vaclav Brezina and Miriam Meyerhoff iii. Inspect the data visually with the aid of scatterplots, bar-charts, etc.; iv. Consider the corpus as a sample of language of a particular speech commu- nity. Consider how much insight the corpus provides into the linguistic behaviour of the speech community and to what extent the findings based on the corpus can be generalised; v. Based on (i)(iv) assess the ecological validity of the results. In this article we tried to show that using aggregate data for sociolinguistic inquiries (which is not an uncommon practice) can be potentially very problematic. Such analyses often reproduce stereotypes about language and society rather than contribute to our understanding of genuine sociolinguistic variation since, as we have shown, similar results can be produced with socially meaningless groupings. References Anderwald, L. 2002. Negation in Non-standard British English. London: Routledge. Anderwald, L. 2005. "Negative concord in British English dialects". In Y. Yieiri (Ed.), Aspects of English Negation. Amsterdam: John Benjamins, 113138. Baker, P. 2010. Sociolinguistics and Corpus Linguistics. Edinburgh: Edinburgh University Press. Barbieri, F. 2008. "Patterns of age-based linguistic variation in American English." Journal of Sociolinguistics, 12 (1), 5888. DOI: 10.1111/j.1467-9841.2008.00353.x Barlow, M. 2002. MonoConc Pro 2.0. Houston: Athelstan Publications. Biber, D. 1991 . Variation Across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. 2009. "A corpus-driven approach to formulaic language in English: Multi-word pat- terns in speech and writing". International Journal of Corpus Linguistics, 14 (3), 275311. DOI: 10.1075/ijcl.14.3.08bib Biber, D., S. Conrad & R. Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511804489 Biber, D., S. Johansson, G. Leech, S. Conrad, E. Finegan & R. Quirk. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Brezina, V. 2013. "Certainty and uncertainty in spoken language: In search of epistemic sociolect and idiolect." In M. Reif, J. Robinson & M. Pьtz (Eds.) Variation in Language and Language Use, Frankfurt am Mein: Peter Lang, 97107. Bucholtz, M. 1999. "`Why be normal?' Language and identity practices in a community of nerd girls". Language in Society, 28 (2): 203223. Chambers, J.K. 2003. Sociolinguistic Theory: Linguistic Variation and its Social Implications. Oxford: Blackwell. Cheshire, J. 1992. Variation in an English Dialect: A Sociolinguistic Study. Cambridge: Cambridge University Press. Deutschmann, M. 2003. Apologising in British English. Umeе: Umeе Universitet. Eckert, P. 1989. "The whole woman: Sex and gender differences in variation". Language Variation and Change, 1 (3): 245267. DOI: 10.1017/S095439450000017X
A critical review of sociolinguistic generalisations 25 Eckert, P. 2000. Linguistic Variation as Social Practice. Oxford: Blackwell. Field, A. 2009. Discovering Statistics Using SPSS. London: SAGE. Gibbons, J.D. & Chakraborti, S. 2003. Nonparametric Statistical Inference. New York: Marcel Dekker. Gries, S.T. 2005. "Null-hypothesis SIGNIFICANCE TESTING of word frequencies: A follow-up on Kilgarriff." Corpus Linguistics and Linguistic Theory, 1 (2), 277294. DOI: 10.1515/cllt.2005. 1.2.277 Gries, S.T. 2006. "Some proposals towards more rigorous corpus linguistics." Zeitschrift fьr Anglistik und Amerikanistik, 54 (2), 191202. Holmes, J. 1995a. "Time for /t/: Initial /t/ in New Zealand English." Australian Journal of Linguistics, 15 (2), 127156. DOI: 10.1080/07268609508599522 Holmes, J. 1995b. Women, Men and Politeness. London: Longman. Hunston, S. & G. Francis. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. DOI: 10.1075/scl.4 Ito, R. & Tagliamonte, S. 2003. "Well weird, right dodgy, very strange, really cool: Layering and recycling in English intensifiers". Language in Society, 32, 257279. DOI: 10.1017/ S0047404503322055 Johnson, D.E. 2009. "Getting off the GoldVarb standard: Introducing Rbrul for mixed-e ffects variable rule analysis". Language and Linguistics Compass, 3 (1), 359383. DOI: 10.1111/ j.1749-818X.2008.00108.x Kiesling, S.F. 1998. "Men's identities and sociolinguistic variation: The case of fraternity men". Journal of Sociolinguistics, 2 (1), 6999. DOI: 10.1111/1467-9481.00031 Kilgarriff, A. 2001. "Comparing corpora". International Journal of Corpus Linguistics, 6 (1), 97133. DOI: 10.1075/ijcl.6.1.05kil Kilgarriff, A. 2005. "Language is never, ever, ever, random". Corpus Linguistics and Linguistic Theory, 1 (2), 263276. Macaulay, R. 2002. "You know, it depends". Journal of Pragmatics, 34 (6), 749767. DOI: 10.1016/S0378-2166(01)00005-4 Mann, H.B. & Whitney, D.R. 1947. "On a test of whether one of two random variables is stochastically larger than the other." The Annals of Mathematical Statistics, 18 (1), 5060. DOI: 10.1214/aoms/1177730491 McEnery, A. & Hardie, A. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press. McEnery, A. & Xiao, Z. 2004. "Swearing in modern British English: The case of fuck in the BNC." Language and Literature, 13 (3), 235268. DOI: 10.1177/0963947004044873 Meyerhoff, M. 1999. "Sorry in the Pacific: Defining communities, defining practices". Language in Society, 28 (2), 225238. DOI: 10.1017/S0047404599002055 Murphy, B. 2009. "`She's a fucking ticket': the pragmatics of fuck in Irish English: An age and gender perspective." Corpora, 4 (1), 85106. DOI: 10.3366/E1749503209000239 Murphy, B. 2010. Corpus and Sociolinguistics: Investigating Age and Gender in Female Talk. Amsterdam: John Benjamins. DOI: 10.1075/scl.38 Nevalainen, T. 1999. "Making the best use of `bad' data: Evidence for sociolinguistic variation in Early Modern English". Neuphilologische Mitteilungen, 100 (4), 499533. Nevalainen, T. & Raumolin-Brunberg, H. 2003. Historical Sociolinguistics. London: Longman. Nickerson, R.S. 2000. "Null hypothesis significance testing: A review of an old and continuing controversy". Psychological Methods, 5 (2), 241. DOI: 10.1037/1082-989X.5.2.241
26 Vaclav Brezina and Miriam Meyerhoff Rayson, P., Berridge, D. & Francis, B. 2004. "Extending the Cochran rule for the comparison of word frequencies between corpora." In G. Purnelle, C. Fairon & A. Dister (Eds.) Le Poids des Mots: Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004), Louvain-la-Neuve: Presses universitaires de Louvain, 926936. Rayson, P., Leech, G. & Hodges, M. 1997. "Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus". International Journal of Corpus Linguistics, 2 (1), 133152. DOI: 10.1075/ijcl.2.1.07ray Schmid, H.-J. 2003. "Do women and men really live in different cultures? Evidence from the BNC." In A.Wilson, P. Rayson & T. McEnery (Eds.) Corpus Linguistics by the Lune. A Festschrift for Geoffrey Leech. Frankfurt: Peter Lang, 185221. Scott, M. 2001. "Comparing corpora and identifying key words, collocations, and frequency distributions through the WordSmith Tools suite of computer programs". In M. Ghadessy, A. Henry & R.L. Roseberry (Eds.) Small Corpus Studies and ELT: Theory and Practice. John Benjamins: Amsterdam, 4767. Scott, M. 1997. "PC analysis of key words and key key words". System, 25 (2), 233245. DOI: 10.1016/S0346-251X(97)00011-0 Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Smith, J. 2001. "Negative concord in the Old and New World: Evidence from Scotland". Lan- guage Variation and Change, 13(2), 109134. DOI: 10.1017/S0954394501132011 Stubbs, M. 2001. "Texts, corpora, and problems of interpretation: A response to Widdowson." applied linguistics, 22 (2), 149172. DOI: 10.1093/applin/22.2.149 Torgersen, E.N., Gabrielatos, C., Hoffmann, S. & Fox, S. 2011. "A corpus-based study of prag- matic markers in London English". Corpus Linguistics and Linguistic Theory, 7 (1), 93118. DOI: 10.1515/cllt.2011.005 Widdowson, H.G. 2000. "On the limitations of linguistics applied". Applied Linguistics, 21 (1), 325. DOI: 10.1093/applin/21.1.3 Xiao, R., & Tao, H. 2007. "A corpus-based sociolinguistic study of amplifiers in British English." Sociolinguistic Studies, 1 (2), 241273.
A critical review of sociolinguistic generalisations 27
The structure of BNC 32
Socioeconomic status AB C2 C2 C2 C2 C2 AB C1 C1 C1 C2 UU AB AB C1 DE C1 C1 C2 AB DE C1 C1 AB C1 AB C2 DE DE UU C2 DE
Region MW NE LO EA SS MC NC WA MS LO NC HC ME LO SS SL MW MD MC EA EA SS LO HC X X EA SL IR UR HC LC
No. of tokens 42,452 51,401 64,422 34,790 44,014 35,271 46,677 38,534 40,937 25,775 36,373 36,083 29,368 16,453 23,773 14,587 33,808 35,151 34,242 38,648 23,038 22,915 24,076 25,772 18,179 22,710 24,094 13,959 22,672 12,892 22,044 15,147
M=male, F=female; A=1534, B=3554, C=55+; AB=managerial, administrative, professional; C1=junior management, supervisory, professional; C2=skilled manual; DE=semi- or unskilled; UU=unknown; EA=East Anglia; HC=home counties; IR=Ireland; LC=Lancashire; LO=London; MC=Central Midlands; MD=Merseyside; ME=North-East Midlands; MS=South Midlands; MW=North-West Midlands; NC=Central Northern England; NE=North-East England; SL=Lower South-West England; SS=Central South-West England; UR=European; WA=Wales.
28 Vaclav Brezina and Miriam Meyerhoff Authors' addresses Vaclav Brezina ESRC Centre for Corpus Approaches to social science Lancaster University FASS Building, County South LA1 4YD, Lancaster UK [email protected] Miriam Meyerhoff Victoria University of Wellington School of Linguistics and Applied Language Studies PO Box 600, Wellington 6140 New Zealand [email protected]