{ "cells": [ { "cell_type": "markdown", "id": "bf43418a", "metadata": {}, "source": [ "# Converting wide data to long data\n", "\n", "There are two major representations of data: one where you have a lot of *columns* (wide data), and another where you have a lot of *rows* (long data). Long data is also known as *tidy* data.\n", "\n", "It's pretty easy to switch between the two once you know what you're doing!\n", "\n", "## Simple example\n", "\n", "This dataset is wide data because beef, ofada and catfish are all different measurements that can be associated with the date." ] }, { "cell_type": "code", "execution_count": 35, "id": "03b8faa5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datebeefofadacafish
02017-01-011001.2377.4899.6
12017-02-011129.0472.3916.5
\n", "
" ], "text/plain": [ " date beef ofada cafish\n", "0 2017-01-01 1001.2 377.4 899.6\n", "1 2017-02-01 1129.0 472.3 916.5" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([\n", " { 'date': '2017-01-01', 'beef': 1001.2, 'ofada': 377.4, 'cafish': 899.6},\n", " { 'date': '2017-02-01', 'beef': 1129.0, 'ofada': 472.3, 'cafish': 916.5},\n", "])\n", "df" ] }, { "cell_type": "markdown", "id": "364e8219", "metadata": {}, "source": [ "The long version of this dataset would just have rows with a date, a food name, and a price. Beef, ofada and catfish prices would all be on separate rows!\n", "\n", "To transform wide data to long data you use the pandas method `df.melt`. You tell it the columns that should be preserved and it separates everything else out into new rows." ] }, { "cell_type": "code", "execution_count": 9, "id": "b649f95f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datefoodprice
02017-01-01beef1001.2
12017-02-01beef1129.0
22017-01-01ofada377.4
32017-02-01ofada472.3
42017-01-01cafish899.6
52017-02-01cafish916.5
\n", "
" ], "text/plain": [ " date food price\n", "0 2017-01-01 beef 1001.2\n", "1 2017-02-01 beef 1129.0\n", "2 2017-01-01 ofada 377.4\n", "3 2017-02-01 ofada 472.3\n", "4 2017-01-01 cafish 899.6\n", "5 2017-02-01 cafish 916.5" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt(id_vars='date', var_name='food', value_name='price')" ] }, { "cell_type": "markdown", "id": "ecd23f00", "metadata": {}, "source": [ "## Converting from wide to long with extra columns\n", "\n", "In the previous example we only re-used `date` in each row. Sometimes you need to keep more than one column!\n", "\n", "With the data below, we want to keep most the `date` and `scale` for each measurement." ] }, { "cell_type": "code", "execution_count": 12, "id": "86fe2e02", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datescalebeefofadacafish
02017-01-01regional1001.2377.4899.6
12017-02-01regional1129.0472.3916.5
\n", "
" ], "text/plain": [ " date scale beef ofada cafish\n", "0 2017-01-01 regional 1001.2 377.4 899.6\n", "1 2017-02-01 regional 1129.0 472.3 916.5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([\n", " { 'date': '2017-01-01', 'scale': 'regional', 'beef': 1001.2, 'ofada': 377.4, 'cafish': 899.6},\n", " { 'date': '2017-02-01', 'scale': 'regional', 'beef': 1129.0, 'ofada': 472.3, 'cafish': 916.5},\n", "])\n", "df" ] }, { "cell_type": "markdown", "id": "a37e6736", "metadata": {}, "source": [ "You juts need to pass a list to `id_vars` and you're good to go! Both `date` and `scale` will be carried over to the new rows, while the remaining columns will turn into a single column." ] }, { "cell_type": "code", "execution_count": 13, "id": "805d5a96", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datescalefoodprice
02017-01-01regionalbeef1001.2
12017-02-01regionalbeef1129.0
22017-01-01regionalofada377.4
32017-02-01regionalofada472.3
42017-01-01regionalcafish899.6
52017-02-01regionalcafish916.5
\n", "
" ], "text/plain": [ " date scale food price\n", "0 2017-01-01 regional beef 1001.2\n", "1 2017-02-01 regional beef 1129.0\n", "2 2017-01-01 regional ofada 377.4\n", "3 2017-02-01 regional ofada 472.3\n", "4 2017-01-01 regional cafish 899.6\n", "5 2017-02-01 regional cafish 916.5" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.melt(id_vars=['date', 'scale'], var_name='food', value_name='price')" ] }, { "cell_type": "markdown", "id": "633e2f2f", "metadata": {}, "source": [ "## Converting from long to wide\n", "\n", "Sometimes you want to do the opposite and transform long data to wide data. Let's use the same data as above:" ] }, { "cell_type": "code", "execution_count": 14, "id": "84cf59f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datefoodprice
02017-01-01beef1001.2
12017-02-01beef1129.0
22017-01-01ofada377.4
32017-02-01ofada472.3
42017-01-01cafish899.6
\n", "
" ], "text/plain": [ " date food price\n", "0 2017-01-01 beef 1001.2\n", "1 2017-02-01 beef 1129.0\n", "2 2017-01-01 ofada 377.4\n", "3 2017-02-01 ofada 472.3\n", "4 2017-01-01 cafish 899.6" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([\n", " { 'date': '2017-01-01', 'food': 'beef', 'price': 1001.2 },\n", " { 'date': '2017-02-01', 'food': 'beef', 'price': 1129.0 },\n", " { 'date': '2017-01-01', 'food': 'ofada', 'price': 377.4 },\n", " { 'date': '2017-02-01', 'food': 'ofada', 'price': 472.3 },\n", " { 'date': '2017-01-01', 'food': 'cafish', 'price': 899.6 },\n", " { 'date': '2017-02-01', 'food': 'cafish', 'price': 916.5 }\n", "])\n", "df.head()" ] }, { "cell_type": "markdown", "id": "0a4c3c3f", "metadata": {}, "source": [ "To convert from long to wide, you'll use `df.pivot_table` on your dataframe." ] }, { "cell_type": "code", "execution_count": 28, "id": "c6b9c608", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodbeefcafishofada
date
2017-01-011001.2899.6377.4
2017-02-011129.0916.5472.3
\n", "
" ], "text/plain": [ "food beef cafish ofada\n", "date \n", "2017-01-01 1001.2 899.6 377.4\n", "2017-02-01 1129.0 916.5 472.3" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index='date', columns='food', values='price')" ] }, { "cell_type": "markdown", "id": "962e5828", "metadata": {}, "source": [ "Right now the `date` column is set up as an index. You can convert it to a normal column with `.reset_index()`." ] }, { "cell_type": "code", "execution_count": 30, "id": "7189fb23", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fooddatebeefcafishofada
02017-01-011001.2899.6377.4
12017-02-011129.0916.5472.3
\n", "
" ], "text/plain": [ "food date beef cafish ofada\n", "0 2017-01-01 1001.2 899.6 377.4\n", "1 2017-02-01 1129.0 916.5 472.3" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index='date', columns='food', values='price').reset_index()" ] }, { "cell_type": "markdown", "id": "e7124ac9", "metadata": {}, "source": [ "## Transforming from long to wide with multiple columns\n", "\n", "Converting from long to wide while preserving multiple columns is practically the same as doing it for one column." ] }, { "cell_type": "code", "execution_count": 20, "id": "c93ebc83", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datescalefoodprice
02017-01-01regionalbeef1001.2
12017-02-01regionalbeef1129.0
22017-01-01regionalofada377.4
32017-02-01regionalofada472.3
42017-01-01regionalcafish899.6
52017-02-01regionalcafish916.5
\n", "
" ], "text/plain": [ " date scale food price\n", "0 2017-01-01 regional beef 1001.2\n", "1 2017-02-01 regional beef 1129.0\n", "2 2017-01-01 regional ofada 377.4\n", "3 2017-02-01 regional ofada 472.3\n", "4 2017-01-01 regional cafish 899.6\n", "5 2017-02-01 regional cafish 916.5" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([\n", " {'date': '2017-01-01', 'scale': 'regional', 'food': 'beef', 'price': 1001.2 },\n", " {'date': '2017-02-01', 'scale': 'regional', 'food': 'beef', 'price': 1129.0 },\n", " {'date': '2017-01-01', 'scale': 'regional', 'food': 'ofada', 'price': 377.4 },\n", " {'date': '2017-02-01', 'scale': 'regional', 'food': 'ofada', 'price': 472.3 },\n", " {'date': '2017-01-01', 'scale': 'regional', 'food': 'cafish', 'price': 899.6 },\n", " {'date': '2017-02-01', 'scale': 'regional', 'food': 'cafish', 'price': 916.5 },\n", "])\n", "df" ] }, { "cell_type": "markdown", "id": "e5f4b238", "metadata": {}, "source": [ "The only difference is when using `pivot_table` you pass *two* columns instead of one." ] }, { "cell_type": "code", "execution_count": 32, "id": "7ad7af63", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foodbeefcafishofada
datescale
2017-01-01regional1001.2899.6377.4
2017-02-01regional1129.0916.5472.3
\n", "
" ], "text/plain": [ "food beef cafish ofada\n", "date scale \n", "2017-01-01 regional 1001.2 899.6 377.4\n", "2017-02-01 regional 1129.0 916.5 472.3" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index=['date', 'scale'], columns='food', values='price')" ] }, { "cell_type": "markdown", "id": "60c05d1d", "metadata": {}, "source": [ "To turn the axes - the bold parts - into normal columns, all you need to do is `reset_index()`." ] }, { "cell_type": "code", "execution_count": 34, "id": "f132fb73", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fooddatescalebeefcafishofada
02017-01-01regional1001.2899.6377.4
12017-02-01regional1129.0916.5472.3
\n", "
" ], "text/plain": [ "food date scale beef cafish ofada\n", "0 2017-01-01 regional 1001.2 899.6 377.4\n", "1 2017-02-01 regional 1129.0 916.5 472.3" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot_table(index=['date', 'scale'], columns='food', values='price').reset_index()" ] }, { "cell_type": "code", "execution_count": null, "id": "f8e6cd2a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }