## Do you have a sample that is not representative of the population? Here’s a way to handle that!

I ran into a simple (and probably all too common) problem the other day: my wife had survey data for an MBA project, from which she needed to draw conclusions about potential buyers of a product.

The problem was that the sample was not representative of the overall French population (the population she ultimately wanted to study for this specific project). Most of the respondents were young men between 20 and 30 years old, while 52% of the French population are women, and few people over 50 responded.

How might she use this non-representative data to draw generalizable conclusions? The only solution I could think of was to resample the data, stratify it to ensure that the representation of each group would match that of the original population. But, by doing this, we would end up losing too much data.

I ended up contacting a professor I had during my master’s in statistics. She specialized in survey sampling, so I assumed she would know an alternative answer to our problem. Fortunately, I was right!

Unrepresentative samples are a problem in 2 different senses: some segments of the population are over-represented and others are under-represented. In order to solve the overall problem, we then have two options: increase representation of the underrepresented, or decrease representation of the overrepresented.

## Declining representation

My first idea was to randomly exclude from my analysis people belonging to overrepresented groups, in this case young men. It can be a bit difficult to do this by hand, since we have more than one variable involved. Instead, we can create a stratified sample, using age and gender as stratification variables, to select the responses we want. dungeonthen exclude others.

This is certainly a possibility, but there is a problem: we lose data, which means that the final sample size will be smaller than we expected, thus increasing the variance of the estimator.

## Increase representation

My former professor then gave me a second, better suggestion: a technique called “post-stratification weighting”, which involves calculating our estimates on all the data collected, but then weighting the answers according to the demographic data previously collected (such as census data) .

In our example, this means weighting the responses using the age and gender proportions within the French population (this type of data can easily be obtained online). I then collected the percentages of men and women by age group in France, and weighted the responses accordingly.

I generated fake data to simulate the situation mentioned above. Imagine we have a sample of 10,000 observations, where younger men are overrepresented:

`import pandas as pddf = pd.DataFrame({'gender': ['M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'M'],'age_range': ['20-25', '20-25', '25-30', '30-35', '20-25', '20-25', '30-35', '25-30', '20-25', '30-35'],'answer1': [5, 7, 7, 8, 5, 7, 6, 9, 8, 9],'answer2': ['A', 'A', 'B', 'A', 'C', 'B', 'C', 'A', 'C', 'C']})df = pd.concat(1000*[df]) # we do this in order to have enough data to sample from`

We could start by calculating the frequencies of each gender x age range in our sample, calling them “prev_weight”, and then adding them to our dataset:

`prev_weights = df.groupby(['gender','age_range']).size().reset_index().rename(columns={0:'prev_weight'})df = df.merge(prev_weights, on=['gender', 'age_range'])`

We would then add a column called “weight”, which would give us the frequency of these groups within the overall population:

`props = pd.DataFrame({'gender': ['M', 'M', 'M', 'F', 'F', 'F'],'age_range': ['20-25', '25-30', '30-35', '20-25','25-30', '30-35'],'weight': [0.15, 0.16, 0.17, 0.17, 0.16, 0.19]})df = df.merge(props, on=['gender', 'age_range'])`

Now, the last step is to adjust the weights according to the frequencies in our original samples: the more a group was overrepresented, the more we have to compensate for it:

`df['weight'] = df['weight']/df['prev_weight']`

Now, whenever we want to estimate the overall population, we can use these weights. Here is an example of the results we get with and without weighting:

`>>> df[‘answer1’].mean()7.1>>> (df[‘answer1’]*df[‘weight’]).sum()7.405000000000001`

Full code is available here.

The main problem with these approaches is that they address the most obvious discrepancies between the sample and the population, but others might still exist. For example, in this case, we knew the age and gender of the respondents, but what about their level of education? Revenue? In which region of France do they live? Since the sample was not completely random, nor stratified by all of these other variables, the methods mentioned above will not fix this. Yet our results will be closer to reality than if we had just used our data as is.

Another issue is that answer weighting works for simple cases, where we just want to make estimates such as averages or top picks. If you want to take it a step further and run models or tests on this data, you’ll probably need a different set of techniques. 