import csv
input = open("people.csv", "r")
= csv.DictReader(input) people
The most common way for anyone to share data is through files. Because you are using a computer, one can safelly assume that you have already worked with files, but there’s a lot more to learn about them if you want to become proficient in software and if you want to be able to programmatically use and create files.
The most likely file formats you will encounter
Python can be used to work with a wide variety of file formats, but the most common ones if you are working with data are csv
, xls
and possibly some json
.
There are many other files formats that you can encounter and work with, but if you get going with these three, you will be able to work with most of the data you will encounter with enough practice!
First, some basics
Before we start working with files, let’s first understand some basic concepts:
- File: A file is a collection of data stored in a disk with a specific name and a specific format.
- File Format: A file format is a standard way that information is encoded for storage in a computer file.
- File Path: A file path is the human-readable representation of the location of a file in a computer.
- File Extension: A file extension is a suffix at the end of a filename that indicates what type of file it is.
If we take an example, the file path C:\Users\JohnDoe\Documents\example.txt
is a file path that points to a file named example.txt
that is located in the Documents
folder of the JohnDoe
user in the Users
folder of the C:
drive. If you were using a Mac, the file path would look like /Users/JohnDoe/Documents/example.txt
, and if you were using a Linux machine, it would look like /home/JohnDoe/Documents/example.txt
.
The file extension is the .txt
part of the file name, and it indicates that the file is a text file. A text file is a file that contains only text, and it can be opened and read by any text editor.
Any file which can be opened and read by a text editor is called a text file. However, a text file doesn’t have to have a .txt
extension. For example, a .csv
file is also a text file, but it has a .csv
extension because it is a file that contains comma-separated values (csv
stands for “Comma Separated Values”).
On the other hand, a .xls
file is not a text file, it is a binary file that can only be opened and read by a program that understands the Excel file format, such as Microsoft Excel or Apple’s Numbers.
Text files such as .txt
and .csv
also have something called an encoding. An encoding is a way to represent text in a computer file. The most common encoding is utf-8
, which is a standard way to represent text in a computer file. However, there are many other encodings, such as ascii
, latin-1
, utf-16
, etc.
When you open a text file in a text editor, the text editor will automatically detect the encoding of the file and display the text correctly. However, when you open a text file in a program that doesn’t understand the encoding of the file, the text may be displayed incorrectly. If you encounter this problem (e.g., if you open a text file in Python and the text is displayed incorrectly), you may need to specify the encoding of the file when you open it. This should not happen often, but it is something to keep in mind.
Working with CSV files
A CSV file is a text file that contains comma-separated values. Each line in a CSV file represents a row of data, and the values in each row are separated by commas. For example, the following is a CSV file that contains information about some people:
name,age,gender,nationality
John Doe,30,male,Great Britain
Jane Smith,27,female,New Zealand
Markus Müller,35,male,Germany
You will notice that the first line of the file contains the names of the columns, and the subsequent lines contain the values of the columns. This is a common convention in CSV files, but it is not required. Some CSV files may not have column names, and some may have different delimiters (e.g., semicolons instead of commas).
To work with CSV files in Python, you can use the csv
module (this is a builtin Python module), which provides functions for reading and writing CSV files. Here is an example of how you can read a CSV file in Python.
That is it! You loaded a CSV file into Python! You can now work with the data in the file as you would with any other data in Python, for example you can check which columns are in the file, you can filter the data, you can calculate statistics, etc.
In the above code, we used the DictReader
method of the csv
module to read the CSV file which we opened with open('people.csv', 'r')
(the r
just indicates we are opening the file for reading only). This method reads the CSV file and returns an iterator that yields a dictionary for each row of the file. The keys of the dictionary are the column names, and the values are the values of the columns.
You can check the column names of the CSV file by calling the fieldnames
attribute of the DictReader
object:
print(people.fieldnames)
['name', 'age', 'gender', 'nationality']
We can then iterate over the rows of the CSV file and print each row.
for person in people:
print(person)
{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': '27', 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}
Keep in mind that the rows are represented as dictionaries, so you can access the values of the columns by using the column names as keys.
for person in people:
print(person["name"], person["age"])
Wait… why did the code above print nothing ?!?
The reason is that the DictReader
object is an iterator, and iterators in Python are consumed when you iterate over them. This means that once you have iterated over the DictReader
object, you cannot iterate over it again.
To iterate over the DictReader
object multiple times, you need to read the file again into a DictReader
object. When reading a CSV file using the csv
module, it is common to read the file into a list of dictionaries, so that you can iterate over the list multiple times.
# Reset back to the beginning of the file
input.seek(0)
# Read the file again, this time as a list of dictionaries
= list(csv.DictReader(input))
people
# Iterate over the list of dictionaries...
for person in people:
print(person)
# Iterate over the list of dictionaries again...
for person in people:
print(person["name"], person["age"])
{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': '27', 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}
John Doe 30
Jane Smith 27
Markus Müller 35
That’s better. And you’ve learned something new about iterators! In the example above we also reset the file pointer to the beginning of the file using the seek(0)
method of the file object (0
means beginning of the file). This is necessary because the file pointer is at the end of the file after reading the file, and we need to move it back to the beginning of the file to read the file again.
Writing to a CSV file
Just as you can read a CSV file using the csv
module, you can also write to it. As an example, let us update our people list and write the updated list to a new CSV file.
# Change the age of "Jane Smith" to 26
for person in people:
if person["name"] == "Jane Smith":
"age"] = 26
person[
for person in people:
print(person)
{'name': 'John Doe', 'age': '30', 'gender': 'male', 'nationality': 'Great Britain'}
{'name': 'Jane Smith', 'age': 26, 'gender': 'female', 'nationality': 'New Zealand'}
{'name': 'Markus Müller', 'age': '35', 'gender': 'male', 'nationality': 'Germany'}
That worked well! Let us now take the people
list and write it to a new CSV file.
# Write the updated data back to a new file
= open("people-updated.csv", "w")
output
= csv.DictWriter(output, fieldnames=["name", "age", "gender", "nationality"])
writer
writer.writeheader()
for person in people:
writer.writerow(person)
output.close()
In the above code, we used the DictWriter
method of the csv
module to write the people
list to a new CSV file named people_updated.csv
. We opened the file with open('people_updated.csv', 'w')
(the w
indicates we are opening the file for writing only), and we passed the column names to the fieldnames
argument of the DictWriter
object. We then wrote the column names to the file using the writeheader
method of the DictWriter
object, and we wrote the rows of the people
list to the file using the writerow
method of the DictWriter
object. Finally, we closed the file using the close
method of the file object.
Pandas, and files
Pandas is a powerful data manipulation library for Python that provides data structures and functions for working with structured data. One of the main features of Pandas is its ability to read and write data from and to a wide variety of file formats, including CSV, Excel, JSON, SQL, and many others.
While in the previous section we learned how to read and write CSV files using the csv
module, in this section we will learn how to read and write CSV files using Pandas. It provides a much simpler and more powerful interface for working with files which hold the types of data you will likely encounter, and it will make your life much easier when working with data!
This doesn’t mean you should forget about the more lower level ways of working with files, but it is good to know that you have this option available to you, as it will probably be the most common way you will work with files in the future.
What is Pandas and what is it for ?
Pandas literally is the swiss army knife of data manipulation in Python. Together with Numpy (which is a library for numerical computing in Python), it is the most used library for data manipulation in Python. It provides data structures and functions for working with structured data, and it is widely used in data science, machine learning, and other fields where data analysis is required.
The main data structure in Pandas is the DataFrame
, which is a two-dimensional table of data with rows and columns. A DataFrame
is similar to a spreadsheet in Excel or a table in a database, and it provides a powerful interface for working with structured data. You can think of a DataFrame
as a collection of Series
objects, where each Series
object represents a column of the DataFrame
.
In data science, a DataFrame
is the most common way to represent structured data, and it is used in many libraries and tools for data analysis, machine learning, and other tasks. If you are working with structured data in Python, you will likely be using DataFrame
objects to represent the data. Besides DataFrame
, Pandas also provides a Series
object, which is a one-dimensional array of data with an index.
Anaconda already includes Pandas, so you shouldn’t need to install it. You can right away start using it in your Jupyter Notebooks. Let us create the series and the dataframe from the sales dataframe above as an example.
import pandas as pd
# Create a months series
= pd.Series(["January", "February", "March", "April"])
months = pd.Series([180391, 197156, 193501, 199468])
sales = pd.DataFrame({"Month": months, "Sales": sales})
sales_dataframe
sales_dataframe
Month | Sales | |
---|---|---|
0 | January | 180391 |
1 | February | 197156 |
2 | March | 193501 |
3 | April | 199468 |
Reading files with Pandas
Pandas provides abstractions which make file handling much easier. For example, to read a CSV file into a DataFrame
, you can use the read_csv
function of Pandas. This function reads a CSV file and returns a DataFrame
object that represents the data in the file. Let’s load the people.csv
file into a DataFrame
.
= pd.read_csv("people.csv")
people
people
name | age | gender | nationality | |
---|---|---|---|---|
0 | John Doe | 30 | male | Great Britain |
1 | Jane Smith | 27 | female | New Zealand |
2 | Markus Müller | 35 | male | Germany |
That was super easy! A dataframe is a much more powerful way to work with data than a list of dictionaries, as it provides many more functions and methods to work with the data. For example, you can filter the data, you can calculate statistics, you can group the data, you can join the data with other data, etc.
For example, let us calculate the average age of the people in the people
dataframe.
= people["age"].mean()
average_age
print(average_age)
30.666666666666668
Updating the people
dataframe is also very easy. For example, let us update the age of the people in the people
dataframe and write the updated dataframe to a new CSV file like before. First, we will update the dataframe. To do so, we will use the loc
method of the DataFrame
object - loc
is used to access a group of rows and columns by labels.
"name"] == "Jane Smith", "age"] = 26 people.loc[people[
The above code looks through a table of people, finds every entry where the person’s name is “Jane Smith,” and changes their age to 26. Much easier than working with a for
loop like we did before.
We could also add a new row to the people
dataframe. To do so, we will use the concat
method of the DataFrame
object.
= pd.DataFrame(
new_people
[
{"name": "Florentino das Rosas",
"age": 51,
"gender": "male",
"nationality": "Portugal",
}
]
)
= pd.concat([people, new_people], ignore_index=True)
people
people
name | age | gender | nationality | |
---|---|---|---|---|
0 | John Doe | 30 | male | Great Britain |
1 | Jane Smith | 26 | female | New Zealand |
2 | Markus Müller | 35 | male | Germany |
3 | Florentino das Rosas | 51 | male | Portugal |
The way you add new rows to a dataframe is by concatenating an existing dataframe with a new dataframe that contains the new rows. In the above code, we created a new dataframe called new_people
that contains a new row with the name “Florentino das Rosas,” age 51. We then concatenated the people
dataframe with the new_people
dataframe using the concat
method of the DataFrame
object, and we assigned the result to people
again.
ignore_index=True
is used to ignore the index of the new dataframe you are adding (which will be 0
) and create a new sequential index for the concatenated dataframe. If you don’t use ignore_index=True
, the index of the new new_people
dataframe will be used as the index of the concatenated dataframe.
Let us now write the updated people
dataframe to a new CSV file.
"people-updated.csv", index=False) people.to_csv(
As you can see, much more concise and easier to work with than the csv
module. Later on, we will dive deeper into Pandas and learn more about its capabilities, but for now, this should be enough to get you started with working with files in Python.
Exercises
- Write a Python program that reads a CSV file containing information about mine exploration (with the columns
mine_name
,location
,tonnes_extracted
,ore_grade
), and which calculates the total amount of ore extracted from all mines in the file. Use Pandas if you prefer. - Add a new row to the CSV file with the information of a new mine, and write the updated data to a new CSV file.