CPS Microdata Guide | BD Economics

Current Population Survey Microdata with Python

January 2026

This tutorial shows two ways to read CPS microdata with Python:

CSV method - simplest approach, recommended for quick access to single months of recent data
Struct method - fastest approach, recommended for processing many months of data

Note: IPUMS provides a user-friendly interface for downloading CPS data (and other surveys). IPUMS handles the complexity of variable selection and file formatting, making it an excellent alternative to working with raw Census files directly.

See also: Tom Augspurger's pycps (archived) and his four-part blog series (1, 2, 3, 4) as resources for working with CPS microdata in python

The Census Basic Monthly CPS page contains the microdata files (in both CSV and fixed-width format), along with data dictionaries identifying each variable name, location, value range, and whether it applies to a restricted sample.

Method 1: CSV files (recommended for single months)

Census publishes CSV files for recent months of CPS data. This is the simplest way to load CPS microdata and is useful for quick access to a limited amount of recent data. Download the CSV file from the Census CPS page.

Note that CSV files are larger than the fixed-width format files, and are not available for all years.

Read and filter the data

This example calculates the employment-to-population ratio for women age 25 to 54 in December 2025. First, we read only the columns we need from the CSV file. This speeds up the code and uses less memory than reading the entire file. We then filter to our population of interest: women (pesex == 2) between ages 25 and 54.

In[1]:

import pandas as pd
import numpy as np

# Read selected columns and query for women age 25 to 54
columns = ['prtage', 'pesex', 'prempnot', 'pwcmpwgt']
df = (pd.read_csv('dec25pub.csv', usecols=columns).dropna()
        .query('pesex == 2 and 25 <= prtage <= 54'))

Calculate the weighted employment rate

The CPS is a sample survey, so each observation represents many people in the population. The pwcmpwgt variable is the person-level composite weight, which tells us how many people each survey respondent represents. We create an indicator variable for employment (prempnot == 1 means employed) and then calculate the weighted average to get the employment rate.

In[2]:

# Identify employed portion of group as 1 & the rest as 0
empl = np.where(df['prempnot'] == 1, 1, 0)

# Take sample weighted average of employed portion of group
epop = np.average(empl, weights=df['pwcmpwgt'])

# Print out the result to check against LNU02300062
print(f'December 2025: {epop*100:.1f}%')

December 2025: 75.4%

This result matches the BLS published value for December 2025.

Method 2: Struct method (fastest for many months)

If you are processing decades of monthly data, the struct method is the fastest approach. This method reads the fixed-width format files directly, where each variable occupies a specific position in each row of data.

Download the data dictionary (e.g., January_2017_Record_Layout.txt) and the microdata file (e.g., apr17pub.dat) from the Census CPS page. This example calculates the same employment-to-population ratio for women age 25 to 54, but for April 2017.

In[3]:

# Import relevant libraries
import re, struct
import pandas as pd
import numpy as np

Parse the data dictionary

The data dictionary file describes how to read the fixed-width format CPS microdata files. It tells us where each variable is located in the raw data. We manually identify four variables of interest: PRTAGE for age, PESEX for gender, PREMPNOT for employment status, and PWCMPWGT for the person-level composite weight.

In[4]:

# Read data dictionary text file
data_dict = open('January_2017_Record_Layout.txt').read()

# Manually list out the IDs for series of interest
var_names = ['PRTAGE', 'PESEX', 'PREMPNOT', 'PWCMPWGT']

The data dictionary text file follows a pattern that makes it machine readable. We use a regular expression to extract the variable name, length, and start/end positions. The start location is adjusted by -1 for Python's zero-based indexing. The width is stored as a string ending in s, which is the struct format code for a character.

Note that data dictionaries change over time and don't follow a consistent format, so the regex pattern may need adjustment for different years.

In[5]:

# Regular expression matching series name and data dict pattern
p = f'\n({"|".join(var_names)})\s+(\d+)\s+.*?\t+.*?(\d\d*).*?(\d\d+)'

# Dictionary of variable name: [start, end, and length + 's']
d = {s[0]: [int(s[2])-1, int(s[3]), f'{s[1]}s']
     for s in re.findall(p, data_dict)}

print(d)

{'PRTAGE': [121, 123, '2s'], 'PESEX': [128, 130, '2s'], 'PREMPNOT': [392, 394, '2s'], 'PWCMPWGT': [845, 855, '10s']}

Build the struct format string

Python's struct module can efficiently parse binary data using a format string. The format string specifies which characters to keep and which to skip. For example, 121x means skip 121 characters, while 2s means keep the next 2 characters as a string. By chaining these together, we can extract just the variables we need from each row.

In[6]:

# Lists of variable starts, ends, and lengths
start, end, width = zip(*d.values())

# Create list of which characters to skip in each row
skip = ([f'{s - e}x' for s, e in zip(start, [0] + list(end[:-1]))])

# Create format string by joining skip and variable segments
unpack_fmt = ''.join([j for i in zip(skip, width) for j in i])
print(unpack_fmt)

# Struct can interpret row bytes with the format string
unpacker = struct.Struct(unpack_fmt).unpack_from

121x2s5x2s262x2s451x10s

Reading this format string: skip 121 characters, keep 2 (age), skip 5, keep 2 (sex), skip 262, keep 2 (employment status), skip 451, keep 10 (weight).

Understanding fixed-width format

To see what the raw data looks like, here is the first line of the microdata file:

In[7]:

print(open('apr17pub.dat').readline())

000110116792163 42017 120100-1 1 1-1 115-1-1-1  15049796 1 2 1 7 2 0 205011 2  1 1-1-1-1 36 01 338600001103000   -1-1 1-1420 1 2 1 2-1 243 1-1 9-1 1-1 1 1 1 2 1 2 57 57 57 1 0 0 1 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 2-150-1-1 50-1-1-1-1 2-1 2-150-1 50-1-1    2 5 5-1 2 3 5 2-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 -1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 1-121 1 1 1 6-1-1-1 -1-1-1 1 2-1-1-1-1 1 2 1 6 4      -1-1       4 3 3 1 2 4-1-1 6-138-114-1 1 9-1 3-1 2 1 1 1 0-1-1-1-1  -1  -1  -1  -10-1      -10-1-1      -1      -10-1-1-1-1-1-1-1-1-1 2-1-1 2  15049796  22986106         0  16044411  15280235 0 0 1-1-1-1 0 0 1 0-1 050 0 0 0 0 1 0 0 0-1-1-1 1 0 0-1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0 0 0-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1 0 1 1 3865 1-1-1-1-1-1-1 1 1 1-1-1-1  1573071277704210  -1  -114-1-1-1-1-1 0-1-1-1-1-15050 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0-1-1-1-1-1 1 1 1202020                                            A

If we skip the first 121 characters and keep the next two, we find 42, which is the age of the person in the first row of the microdata.

Read the raw microdata

We open the raw CPS microdata file in binary mode and read all lines. For each row, we check if the sample weight is positive (meaning the observation should be included), then apply the unpacker to extract just the four variables we need. The result is a list of lists, where each inner list contains the values for one person.

In[8]:

# Open file (read as binary) and read lines into "raw_data"
raw_data = open('apr17pub.dat', 'rb').readlines()

wgt = d['PWCMPWGT']  # Location of sample weight variable

# Unpack and store data of interest if sample weight > 0
data = [[*map(int, unpacker(row))] for row in raw_data
        if int(row[wgt[0]:wgt[1]]) > 0]

print(data[:5])

[[42, 1, 1, 15730712], [26, 2, 1, 14582612], [25, 2, 1, 20672047], [42, 2, 4, 15492377], [47, 1, 1, 18155638]]

Create pandas dataframe

We convert the list of lists to a pandas DataFrame for easier filtering and analysis. The DataFrame is filtered to women (PESEX == 2) between ages 25 and 54. The sample weights have four implied decimal places in the raw data, so we divide by 10,000 to get the actual weight values.

In[9]:

# Pandas dataframe of women age 25 to 54
df = (pd.DataFrame(data, columns=d.keys())
      .query('PESEX == 2 and 25 <= PRTAGE <= 54')
      .assign(PWCMPWGT = lambda x: x['PWCMPWGT'] / 10000))

print(df.head().to_string(index=False))

PRTAGE  PESEX  PREMPNOT   PWCMPWGT
    26      2         1  1458.2612
    25      2         1  2067.2047
    42      2         4  1549.2377
    49      2         1  1633.0038
    26      2         1  1611.2316

Calculate the weighted employment rate

As with the CSV method, we create an indicator variable for employment (PREMPNOT == 1 means employed) and calculate the weighted average using the composite weight. The result matches the BLS published value for April 2017.

In[10]:

# Identify employed portion of group as 1 & the rest as 0
empl = np.where(df['PREMPNOT'] == 1, 1, 0)

# Take sample weighted average of employed portion of group
epop = np.average(empl, weights=df['PWCMPWGT'])

# Print out the result to check against LNU02300062
print(f'April 2017: {epop*100:.1f}%')

April 2017: 72.3%

Scaling up

These examples can be scaled up to work with multiple years of monthly data. For a project creating harmonized partial CPS extracts, see here.

About the CPS

The CPS was initially deployed in 1940 to give a more accurate unemployment rate estimate, and it is still the source of the official unemployment rate. The CPS is a monthly survey of around 65,000 households. Each selected household is surveyed up to 8 times. Interviewers ask basic demographic and employment information for the first three interview months, then ask additional detailed wage questions on the 4th interview. The household is not surveyed again for eight months, and then repeats four months of interviews with detailed wage questions again on the fourth.

The CPS is not a random sample, but a multi-stage stratified sample. In the first stage, each state and DC are divided into "primary sampling units". In the second stage, a sample of housing units are drawn from the selected PSUs.

There are also months where each household receives supplemental questions on a topic of interest. The largest such "CPS supplement", conducted each March, is the Annual Social and Economic Supplement. The sample size for this supplement is expanded, and the respondents are asked questions about various sources of income, and about the quality of their jobs (for example, health insurance benefits). Other supplements cover topics like job tenure, or computer and internet use.

The CPS is a joint product of the U.S. Census Bureau and the Bureau of Labor Statistics.

Special thanks to John Schmitt for guidance on the CPS.