30 data cleaning tips in 60 minutes

This is a workshop for IRE 2024 about a few general data cleaning tips! It’s taught by me, Jonathan Soma, Knight Chair in Data Journalism at Columbia Journalism School.

Contact: Find me at @dangerscarf or js4571@columbia.edu.

Other things: You might also be interested in my series Practical AI for Investigative Journalism or my website Everything I Know

Content

Slides: Slides are here

This stuff really only makes sense once you’ve tried it out, so we also have worksheets!

Google Sheets

Blank worksheet
Completed worksheet
Completed worksheet from in class (which definitely has less stuff in it than the other completed one!)

Python

You can also see the original GitHub repository

Text manipulation

Substitute with nothing

Let’s say you have 66 years old and just want the 66. The easiest approach is to just remove the years old portion.

The worst thing about Sheets is there is a REPLACE and there is a SUBSTITUTE. You want SUBSTITUTE, as REPLACE absolutely does NOT do what you want.

=SUBSTITUTE(A2, " years old", "")

In pandas, .replace will replace entire cells of data, while .str.replace is better for replacing portions of the content.

df['age'].str.replace(" years old", "")

Convert to a number

If you converted 66 years old to juse 66, you usually also have to make sure your software knows it’s a number. This is formally known as “type casting.”

Use VALUE(...) to turn text into a number.

=VALUE(SUBSTITUTE(A2, " years old", ""))

Use .astype(int) or .astype(float) to turn text into a number. If there are missing values you must use .astype(float), even if there are no decimals in the numbers!

df['age'].str.replace(" years old", "").astype(int)

Split

If you have text that looks like j soma, js4571@columbia.edu, @dangerscarf and want to get each section individually, you can split on the ,. The result will now take up multiple columns.

By default, splitting on ", " will split on either comma or space. To tell Sheets that you want to split on a comma AND a space, you add false.

=SPLIT(A2, ", ", false)

Most of the time you’ll just be splitting on one symbol - a space, a |, something like that - so you won’t need to use that part.

=SPLIT(A2, ", ")

Using .str.split with expand=True will give you a secondary dataframe.

df['details'].str.split(", ", expand=True)

Usually you’d use .join to add it back to the original df.

details = df['details'].str.split(", ", expand=True)
details.columns = ['name', 'email', 'handle']
df = df.join(details)
df.head()

Split and get first (…or second or third)

If you have text that looks like j soma, js4571@columbia.edu, @dangerscarf and want to just get the email address, you can split and ask for the second portion.

Take your SPLIT code and wrap it in INDEX. The final number is 1 for the first one, 2 for the second one, etc. Below we split on ", " and take the second part.

=INDEX(SPLIT(A2, ", ", false), 2)

Take your split code, and add .str[...] onto the end. The ... is the index of the part you’re looking for. [0] would be the first segment, [1] would be the second, etc.

# Split on ", " and take the second part
df['details'].str.split(", ").str[1]

Split and get last

Getting the last one in Sheets is a real pain because you need to use COUNTA and duplicate your split formula in order to count the number of items you created.

=INDEX(SPLIT(A2, ", ", false), COUNTA(SPLIT(A2, ", ", false)))

If you want to feel jealous look at the Python version.

The best thing about Python is using negative numbers to step backwards. .str[-1] gives you the last element.

df['details'].str.split(", ").str[-1]

Double split!!!

Honestly, just do this across multiple columns.

If you have j soma, js4571@columbia.edu, @dangerscarf, split once on ", " to split up the pieces and take the second one. Then split again on "@" and take the second portion.

Get last (or first) n characters

This will get the final twelve characters in a cell.

=RIGHT(F2, 12)

If you want to get the first twelve, use =LEFT instead.

Let’s get the final twelve characters of the details column.

df['details'].str[-12:]

If you want to get the first twelve, you use [:5] instead

df['details'].str[:12]

Regex

Just ask ChatGPT, honestly. Your job is to proof the results, not understand the process.

Match everything

When you want to extract something with regular expressions, you wrap it in parentheses.

I am 66 years old I am (66) years old I am (.*) years old

The .* means anything could be inside of the matched part - numbers, text, symbols, etc.

=REGEXEXTRACT(A2, "I am (.*) years old")

df['details'].str.extract(r"I am (.*) years old")

Don’t forget you might need to convert the result to a number! (see above)

Numbers ONLY

You can also use regex to only match numbers. For example, if you wanted to find a phone number formatted like 555-555-5555, you would replace every number with a \d.

In this case, it would end up as (\d\d\d-\d\d\d-\d\d\d\d).

=REGEXEXTRACT(A2, "(\d\d\d-\d\d\d-\d\d\d\d)")

df['details'].str.extract(r"(\d\d\d-\d\d\d-\d\d\d\d)")

Optional elements

Maybe the phone number could be formatted as 555-555-5555 or

Put a question mark after anything that’s optional (\d\d\d-?\d\d\d-?\d\d\d\d).

Anything

Maybe the phone number could be formatted as 555-555-5555 or 5555555555 or 555+555+

Instead of demanding we match an optional -, we can use . to say “this could be ANY character” in here.

Our final regex would look like (\d\d\d.?\d\d\d.?\d\d\d\d).

Escaping

One problem in regex is if you want to match a special character. If we wanted to use . to talk about an actual period instead of the “this could be anything,” we’d write \. instead.

Replace […] in Google Sheets without a formula

Since [ and ] are special characters, you need to escape them! To replace [u] or [x] or [example], your pattern is going to be \[.*?\[.

Our original content: [u] Replace “anything” with .: [.] Add \ to special characters: \[.\]

Matches pattern, not replaces

While you usually use regex for extracting information, sometimes you just want a yes/no about whether it exists.

=REGEXMATCH(A2, "\d\d\d-\d\d\d-\d\d\d\d")

df['details'].str.contains(r"\d\d\d-\d\d\d-\d\d\d\d")

Other things

ARRAYFORMULA

Instead of dragging down a formula, use ARRAYFORMULA!

Sheets

If you have a formula and you want it to apply to every row, you could just type it out and then drrraaagggg it down the page.

=VALUE(SUBSTITUTE(A2, " years old", ""))

Instead, write it in one cell, wrap it in ARRAYFORMULA, and adjust your cell reference from (for example) just A2 to A2:A. This applies it to every single cell in the column, starting from the second row.

=ARRAYFORMULA(VALUE(SUBSTITUTE(A2:A, " years old", "")))

Missing data

Sheets

=IF(ISBLANK(A2), "", ...your formula...)

The only problem with ARRAYFORMULA is that it goes alllll the way down the page, even for rows that don’t have data! If you nudge an IF and ISBLANK in there, though, you’ll be good to go.

=ARRAYFORMULA(IF(ISBLANK(A2,A), "", VALUE(SUBSTITUTE(A2:A, " years old", ""))))

Geocoding

Geocoding is the act of converting from addresses to latitude and longitude.

Drag and drop: Geocodio
Sheets: Geocode by Awesome Table
Python: Geocoder

If your addresses are all in one column, Geocodio requires you to split it into street, city, state, etc. You can probably do that with splitting or regex, as covered above!

Reverse geocoding

Reverse geocoding is the act of converting from latitude and longitude to approximate addresses.

Drag and drop: Geocodio
Sheets: Geocode by Awesome Table
Python: Geocoder

:::

Text analysis: Edit distance and fuzzy matching

Sometimes you need to combine “Jonathan Soma,” “Jonathan P Soma,” and “Soma, Jonathan” all into one listing. Fuzzy matching is what you’re looking for!

Non-coding: OpenRefine
Python: csvmatch

Text analysis: Entity extraction

I could talk forever about entity extraction! Check out this workshop I gave at Abraji for more details.

Table formatting

Transposing

Transposing is when you switch rows and columns. If you ever have a chart in Datawrapper that looks awful, that might be your solution! They have a “Transpose” button in the step for checking your data.

Convert from long to wide

Go watch this

Read the pandas documentation!

Convert from wide to long

Go read this

Read the pandas documentation!

AI stuff

Audio transcription

Everything is based on Whisper, but the only thing you need to know about it is that larger models are slower but perform better.

MacOS: MacWhisper
Windows: EasyWhisper
Python: Insanely Fast Whisper

OCR with open models

If you’re still using Tesseract, you’re doing yourself a disservice! Apparently this new one, Florence 2, is really good.

Go look at my PDFs section on Everything I Know.

OCR corrections

If you have a bad OCR, try pushing it through a large language model like ChatGPT. It’s great at making assumptions and fixing typos! Even if it hallucinates, it’s probably better than what you had initially (and you’re sure to double-check, right??).

Claude in Google Sheets

Maybe you have an unformatted document and want to pull out the company’s name from it, but even regex has failed you.

Claude for Sheets is like magic! Great for extracting content. You’ll need an API key to set it up, which is like a username/password for using the LLM outside of the chatbot.

=CLAUDEEXTRACT("company name", A2)

Watch my video Structured, validated data from LLMs for more details

Guardrails/Instructor

You can also extract structured data from LLMs using tools like Instructor or Guardrails. I used to be a Guardrails guy, but it’s honestly way too overengineered. I’m on team Instructor these days.

Let’s say we had some content that looks like this.

comment = """
FROM: Mulberry Peppertown, mulberry (at) example.com

When I pick up the cans of beans they are all so light! At
first I thought they were empty, but it turns out they are just
futuristic beans that are not heavy like the old style beans 
I was used to. It is incredible.
"""

We can use Instructor to extract the content we’re looking for in a structured way.

class Comment(BaseModel):
    name: str = Field(description="Person who submitted the comment")
    email: Optional[str] = Field(description="Email address of commenter")
    food_item: str = Field(description="Food item the comment is about")
    emotion: Literal["positive", "negative", "uncertain"]

client = instructor.from_openai(OpenAI())

result = client.chat.completions.create(
    response_model=Comment,
    messages=[{"role": "user", "content": comment}],
    max_retries=3
)
result

The best part is the max_retries bit - if the LLM gives us something that doesn’t fit the schema, Instructor just asks it to fix it!

Contact

Feel free to reach out! You can find me on Twitter or on email at js4571@columbia.edu.