There are several common errors when learning Python pandas basics and working with pandas read_csv function. Here are some of the most common ones:
Incorrect delimiter
By default, the read_csv function assumes that the delimiter is a comma. If your CSV file uses a different delimiter, such as a tab or semicolon, you need to specify it using the delimiter parameter.
pandas.errors.EmptyDataError: No columns to parse from file
When using read_csv in Python Pandas, there is an error:
pandas.errors.EmptyDataError: No columns to parse from file
File "pandas\_libs\parsers.pyx", line 555, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
The problem is the separator
df = pd.read.csv(file_path)
By default, the separator in read_csv is comma ','
Solution to error
So if we have tab-delimated csv, we need to point we are using tabs as a separatorsep="\t"
df = pd.read_csv(file_path, sep="\t")
Possible separetors in csv files:
- Tab: sep = "\t"
- Semicolon: sep = ";"
- Colon: sep = ":"
- Space: sep = " "
- Pipe: sep = "|"
- Tilde: sep = "~"
Another solution is to use the Python engine text (engine=" python")
to detect separators automatically. However, the Python engine is slower, and from my experience, automatic detection will not always works properly when you want to connect two different files, for example.
df = pd.read_csv(file_path, engine=”python”)
sepstr, default ','
If you don't specify a delimiter using the sep parameter, the C engine won't be able to detect it automatically. However, the Python parsing engine can see it using the csv. Sniffer tool that comes built-in with Python. So, if sep is not specified, the Python parsing engine will be used instead. Remember that separators with more than one character and different from '\s+' will be treated as regular expressions, which will also require the use of the Python parsing engine. It's important to note that using regex delimiters can result in quoted data being ignored. For example, the regex' \r\t' may cause issues when processing data enclosed in quotes.
engine{'c', 'python', 'pyarrow'}, optional
The C and pyarrow engines are known for their faster performance, while the Python engine is currently more feature-rich. If you require multithreading support, the pyarrow engine is presently the only one that provides it.
Encoding errors
If your CSV file contains non-ASCII characters or uses a non-standard encoding, you may encounter encoding errors. You can specify the encoding using the encoding parameter. If you're not sure what encoding to use, try 'utf-8' or 'latin-1'.
'utf-8' codec can't decode byte 0xff in position 0
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
This error can be caused by encoding. One possibility pandas read_csv can't decode byte is that the CSV file has utf-16 encoding. When I manually downloaded Facebook Leads into CSV file, there where in utf-16 format, for example.
Default encoding in read_csv in pandas is utf-8, so if we are using a different one we need to point to the right one
encoding = ‘utf-8’
df = pd.read_csv(file_path, encoding = ‘utf-16’)
Check encoding of CSV files with Python
The script checks the encoding of all CSV files in the folder and print encoding of each files.
Pip install chardet
import os
import chardet
# specify the folder path containing the CSV files
folder_path = 'path/to/csv/folder'
# iterate through all files in the folder
for filename in os.listdir(folder_path):
if filename.endswith('.csv'): # check if the file is a CSV file
# open the file in binary mode and read the first 1000 bytes to detect the encoding
with open(os.path.join(folder_path, filename), 'rb') as file:
result = chardet.detect(file.read(1000))
# print the filename and encoding
print(f"{filename}: {result['encoding']}")
This code reads each CSV file in the specified folder in binary mode using open() with rb mode, reads the first 1000 bytes of the file to detect the encoding using chardet.detect(), and then prints the filename and encoding.
Change encoding of CSV file with Python
import csv
import codecs
# specify the input and output file paths
input_file = 'path/to/input/file.csv'
output_file = 'path/to/output/file.csv'
# open the input file with utf-16 encoding
with codecs.open(input_file, 'r', encoding='utf-16') as input:
# open the output file with utf-8 encoding
with codecs.open(output_file, 'w', encoding='utf-8') as output:
# read each row from the input file
reader = csv.reader(input)
# write each row to the output file
writer = csv.writer(output)
for row in reader:
writer.writerow(row)
This code reads the input file with utf-16 encoding using codecs.open(), and then writes each row to the output file with utf-8 encoding using csv.writer() and codecs.open(). The csv.reader() is used to read each row from the input file.
Finally, the codecs module opens both files with the desired encodings.
Instead of utf-8 you can use any desired encoding.
Missing values
If your CSV file contains missing values, you need to handle them appropriately. The read_csv function by default considers the following as missing values: ['', 'NA', 'N/A', 'na', 'n/a', '.', 'NaN', '-NaN', 'nan', '-nan']. If your CSV file has missing values with a different format, you need to specify them using the na_values parameter.
Header row issues
By default, the read_csv function assumes that the first row of the CSV file contains the header row. If your CSV file doesn't have a header row, you can specify that using the header parameter. Alternatively, if your CSV file has a header row, but it's not the first row, you can specify the row number using the skiprows parameter.
Type conversion issues
Pandas tries to infer the data types of the columns based on the data in the CSV file. However, sometimes it may not infer the correct data type. You can specify the data type for each column using the dtype parameter.
Memory issues
If your CSV file is too large to fit into memory, you can use the chunksize parameter to read the file in chunks. This parameter specifies the number of rows to read at a time.
Published