Find this page at bit.ly/ds-dojo-2024
データサイエンティストDOJO 2024
Hi, I’m Jonathan Soma! This will host all of the content for Data Scientist Dojo 2024.
Monday
Tuesday
Wednesday
Thursday
Friday
Troubleshooting and project work time
Monday
- Project presentations
- Data- and AI-driven projects (slides)
The best tool to use for working with PDFs is pdfplumber. There are plenty of videos on YouTube about how to use it (although I haven’t used them)
You can also use this interactive demo to test out the cropping/table selection.
Tutorial for an automatic scraper, although you’ll need to change things a little bit to make it work with Playwright. You’ll also want to make your repository private if it isn’t data you’d like to make public! I recommend following the tutorial first to learn how to do it, then doing it with your “real” data separately.
If you’re interested in traditional analysis, you might want to try out investigate.ai, a website I made that is a series of tutorials about data science for non-data-science people (journalists). It’s from before ChatGPT, though, so it might not be the best answer these days. It can give you some ideas about things like regression and text analysis, though.
About the instructor
- Jonathan Soma, Knight Chair in Data Journalism, Columbia Graduate School of Journalism
- js4571@columbia.edu
- @dangerscarf