Working with Files – Pedro Leitão

The most common way for anyone to share data is through files. Because you are using a computer, one can safelly assume that you have already worked with files, but there’s a lot more to learn about them if you want to become proficient in software and if you want to be able to programmatically use and create files.

The most likely file formats you will encounter

Python can be used to work with a wide variety of file formats, but the most common ones if you are working with data are csv, xls and possibly some json.

There are many other files formats that you can encounter and work with, but if you get going with these three, you will be able to work with most of the data you will encounter with enough practice!

First, some basics

Before we start working with files, let’s first understand some basic concepts:

File: A file is a collection of data stored in a disk with a specific name and a specific format.
File Format: A file format is a standard way that information is encoded for storage in a computer file.
File Path: A file path is the human-readable representation of the location of a file in a computer.
File Extension: A file extension is a suffix at the end of a filename that indicates what type of file it is.

If we take an example, the file path C:\Users\JohnDoe\Documents\example.txt is a file path that points to a file named example.txt that is located in the Documents folder of the JohnDoe user in the Users folder of the C: drive. If you were using a Mac, the file path would look like /Users/JohnDoe/Documents/example.txt, and if you were using a Linux machine, it would look like /home/JohnDoe/Documents/example.txt.

The file extension is the .txt part of the file name, and it indicates that the file is a text file. A text file is a file that contains only text, and it can be opened and read by any text editor.

Any file which can be opened and read by a text editor is called a text file. However, a text file doesn’t have to have a .txt extension. For example, a .csv file is also a text file, but it has a .csv extension because it is a file that contains comma-separated values (csv stands for “Comma Separated Values”).

On the other hand, a .xls file is not a text file, it is a binary file that can only be opened and read by a program that understands the Excel file format, such as Microsoft Excel or Apple’s Numbers.

Text files such as .txt and .csv also have something called an encoding. An encoding is a way to represent text in a computer file. The most common encoding is utf-8, which is a standard way to represent text in a computer file. However, there are many other encodings, such as ascii, latin-1, utf-16, etc.

When you open a text file in a text editor, the text editor will automatically detect the encoding of the file and display the text correctly. However, when you open a text file in a program that doesn’t understand the encoding of the file, the text may be displayed incorrectly. If you encounter this problem (e.g., if you open a text file in Python and the text is displayed incorrectly), you may need to specify the encoding of the file when you open it. This should not happen often, but it is something to keep in mind.

Working with CSV files

A CSV file is a text file that contains comma-separated values. Each line in a CSV file represents a row of data, and the values in each row are separated by commas. For example, the following is a CSV file that contains information about some people:

name,age,gender,nationality
John Doe,30,male,Great Britain
Jane Smith,27,female,New Zealand
Markus Müller,35,male,Germany

You will notice that the first line of the file contains the names of the columns, and the subsequent lines contain the values of the columns. This is a common convention in CSV files, but it is not required. Some CSV files may not have column names, and some may have different delimiters (e.g., semicolons instead of commas).

To work with CSV files in Python, you can use the csv module (this is a builtin Python module), which provides functions for reading and writing CSV files. Here is an example of how you can read a CSV file in Python.

import csv

input = open("people.csv", "r")
people = csv.DictReader(input)

That is it! You loaded a CSV file into Python! You can now work with the data in the file as you would with any other data in Python, for example you can check which columns are in the file, you can filter the data, you can calculate statistics, etc.

In the above code, we used the DictReader method of the csv module to read the CSV file which we opened with open('people.csv', 'r') (the r just indicates we are opening the file for reading only). This method reads the CSV file and returns an iterator that yields a dictionary for each row of the file. The keys of the dictionary are the column names, and the values are the values of the columns.

You can check the column names of the CSV file by calling the fieldnames attribute of the DictReader object:

print(people.fieldnames)

['name', 'age', 'gender', 'nationality']

We can then iterate over the rows of the CSV file and print each row.

for person in people:
    print(person)

{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': '27', 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}

Keep in mind that the rows are represented as dictionaries, so you can access the values of the columns by using the column names as keys.

for person in people:
    print(person["name"], person["age"])

Wait… why did the code above print nothing ?!?

The reason is that the DictReader object is an iterator, and iterators in Python are consumed when you iterate over them. This means that once you have iterated over the DictReader object, you cannot iterate over it again.

To iterate over the DictReader object multiple times, you need to read the file again into a DictReader object. When reading a CSV file using the csv module, it is common to read the file into a list of dictionaries, so that you can iterate over the list multiple times.

# Reset back to the beginning of the file
input.seek(0)

# Read the file again, this time as a list of dictionaries
people = list(csv.DictReader(input))

# Iterate over the list of dictionaries...
for person in people:
    print(person)

# Iterate over the list of dictionaries again...
for person in people:
    print(person["name"], person["age"])

{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': '27', 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}
John Doe 30
Jane Smith 27
Markus Müller 35

That’s better. And you’ve learned something new about iterators! In the example above we also reset the file pointer to the beginning of the file using the seek(0) method of the file object (0 means beginning of the file). This is necessary because the file pointer is at the end of the file after reading the file, and we need to move it back to the beginning of the file to read the file again.

Writing to a CSV file

Just as you can read a CSV file using the csv module, you can also write to it. As an example, let us update our people list and write the updated list to a new CSV file.

# Change the age of "Jane Smith" to 26
for person in people:
    if person["name"] == "Jane Smith":
        person["age"] = 26

for person in people:
    print(person)

{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': 26, 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}

That worked well! Let us now take the people list and write it to a new CSV file.

# Write the updated data back to a new file
output = open("people-updated.csv", "w")

writer = csv.DictWriter(output, fieldnames=["name", "age", "gender", "nationality"])
writer.writeheader()

for person in people:
    writer.writerow(person)

output.close()

In the above code, we used the DictWriter method of the csv module to write the people list to a new CSV file named people_updated.csv. We opened the file with open('people_updated.csv', 'w') (the w indicates we are opening the file for writing only), and we passed the column names to the fieldnames argument of the DictWriter object. We then wrote the column names to the file using the writeheader method of the DictWriter object, and we wrote the rows of the people list to the file using the writerow method of the DictWriter object. Finally, we closed the file using the close method of the file object.

Pandas, and files

Pandas is a powerful data manipulation library for Python that provides data structures and functions for working with structured data. One of the main features of Pandas is its ability to read and write data from and to a wide variety of file formats, including CSV, Excel, JSON, SQL, and many others.

While in the previous section we learned how to read and write CSV files using the csv module, in this section we will learn how to read and write CSV files using Pandas. It provides a much simpler and more powerful interface for working with files which hold the types of data you will likely encounter, and it will make your life much easier when working with data!

This doesn’t mean you should forget about the more lower level ways of working with files, but it is good to know that you have this option available to you, as it will probably be the most common way you will work with files in the future.

What is Pandas and what is it for ?

Pandas literally is the swiss army knife of data manipulation in Python. Together with Numpy (which is a library for numerical computing in Python), it is the most used library for data manipulation in Python. It provides data structures and functions for working with structured data, and it is widely used in data science, machine learning, and other fields where data analysis is required.

The main data structure in Pandas is the DataFrame, which is a two-dimensional table of data with rows and columns. A DataFrame is similar to a spreadsheet in Excel or a table in a database, and it provides a powerful interface for working with structured data. You can think of a DataFrame as a collection of Series objects, where each Series object represents a column of the DataFrame.

About DataFrames and Series

In data science, a DataFrame is the most common way to represent structured data, and it is used in many libraries and tools for data analysis, machine learning, and other tasks. If you are working with structured data in Python, you will likely be using DataFrame objects to represent the data. Besides DataFrame, Pandas also provides a Series object, which is a one-dimensional array of data with an index.

Anaconda already includes Pandas, so you shouldn’t need to install it. You can right away start using it in your Jupyter Notebooks. Let us create the series and the dataframe from the sales dataframe above as an example.

import pandas as pd

# Create a months series
months = pd.Series(["January", "February", "March", "April"])
sales = pd.Series([180391, 197156, 193501, 199468])
sales_dataframe = pd.DataFrame({"Month": months, "Sales": sales})

sales_dataframe

	Month	Sales
0	January	180391
1	February	197156
2	March	193501
3	April	199468

Reading files with Pandas

Pandas provides abstractions which make file handling much easier. For example, to read a CSV file into a DataFrame, you can use the read_csv function of Pandas. This function reads a CSV file and returns a DataFrame object that represents the data in the file. Let’s load the people.csv file into a DataFrame.

people = pd.read_csv("people.csv")

people

	name	age	gender	nationality
0	John Doe	30	male	Great Britain
1	Jane Smith	27	female	New Zealand
2	Markus Müller	35	male	Germany

That was super easy! A dataframe is a much more powerful way to work with data than a list of dictionaries, as it provides many more functions and methods to work with the data. For example, you can filter the data, you can calculate statistics, you can group the data, you can join the data with other data, etc.

For example, let us calculate the average age of the people in the people dataframe.

average_age = people["age"].mean()

print(average_age)

30.666666666666668

Updating the people dataframe is also very easy. For example, let us update the age of the people in the people dataframe and write the updated dataframe to a new CSV file like before. First, we will update the dataframe. To do so, we will use the loc method of the DataFrame object - loc is used to access a group of rows and columns by labels.

people.loc[people["name"] == "Jane Smith", "age"] = 26

The above code looks through a table of people, finds every entry where the person’s name is “Jane Smith,” and changes their age to 26. Much easier than working with a for loop like we did before.

We could also add a new row to the people dataframe. To do so, we will use the concat method of the DataFrame object.

new_people = pd.DataFrame(
    [
        {
            "name": "Florentino das Rosas",
            "age": 51,
            "gender": "male",
            "nationality": "Portugal",
        }
    ]
)

people = pd.concat([people, new_people], ignore_index=True)

people

	name	age	gender	nationality
0	John Doe	30	male	Great Britain
1	Jane Smith	26	female	New Zealand
2	Markus Müller	35	male	Germany
3	Florentino das Rosas	51	male	Portugal

The way you add new rows to a dataframe is by concatenating an existing dataframe with a new dataframe that contains the new rows. In the above code, we created a new dataframe called new_people that contains a new row with the name “Florentino das Rosas,” age 51. We then concatenated the people dataframe with the new_people dataframe using the concat method of the DataFrame object, and we assigned the result to people again.

ignore_index=True is used to ignore the index of the new dataframe you are adding (which will be 0) and create a new sequential index for the concatenated dataframe. If you don’t use ignore_index=True, the index of the new new_people dataframe will be used as the index of the concatenated dataframe.

Let us now write the updated people dataframe to a new CSV file.

people.to_csv("people-updated.csv", index=False)

As you can see, much more concise and easier to work with than the csv module. Later on, we will dive deeper into Pandas and learn more about its capabilities, but for now, this should be enough to get you started with working with files in Python.

Exercises

Write a Python program that reads a CSV file containing information about mine exploration (with the columns mine_name, location, tonnes_extracted, ore_grade), and which calculates the total amount of ore extracted from all mines in the file. Use Pandas if you prefer.
Add a new row to the CSV file with the information of a new mine, and write the updated data to a new CSV file.

Reuse

This work is licensed under CC BY (View License)