Reading a Column Pandas From Read Csv

Most of the data is available in a tabular format of CSV files. Information technology is very pop. You can convert them to a pandas DataFrame using the read_csv function. The pandas.read_csv is used to load a CSV file as a pandas dataframe.

In this article, you volition acquire the different features of the read_csv function of pandas apart from loading the CSV file and the parameters which tin be customized to get better output from the read_csv office.

pandas.read_csv

  • Syntax: pandas.read_csv( filepath_or_buffer, sep, header, index_col, usecols, prefix, dtype, converters, skiprows, skiprows, nrows, na_values, parse_dates)Purpose: Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking the file into chunks.
  • Parameters:
    • filepath_or_buffer : str, path object or file-similar object Any valid string path is acceptable. The cord could be a URL too. Path object refers to os.PathLike. File-like objects with a read() method, such as a filehandle (due east.g. via built-in open role) or StringIO.
    • sep : str, (Default ',') Separating boundary which distinguishes betwixt any two subsequent data items.
    • header : int, listing of int, (Default 'infer') Row number(due south) to use equally the column names, and the beginning of the data. The default behavior is to infer the cavalcade names: if no names are passed the behavior is identical to header=0 and column names are inferred from the commencement line of the file.
    • names : assortment-similar List of column names to employ. If the file contains a header row, so you should explicitly pass header=0 to override the column names. Duplicates in this listing are not immune.
    • index_col : int, str, sequence of int/str, or Imitation, (Default None) Column(s) to utilise as the row labels of the DataFrame, either given equally string name or column alphabetize. If a sequence of int/str is given, a MultiIndex is used.
    • usecols : list-like or callable Return a subset of the columns. If callable, the callable function volition be evaluated against the column names, returning names where the callable function evaluates to Truthful.
    • prefix : str Prefix to add to cavalcade numbers when no header, e.thousand. 'Ten' for X0, X1
    • dtype : Blazon name or dict of column -> type Data blazon for data or columns. Due east.g. {'a': np.float64, 'b': np.int32, 'c': 'Int64'} Use str or object together with suitable na_values settings to preserve and non interpret dtype.
    • converters : dict Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
    • skiprows : list-like, int or callable Line numbers to skip (0-indexed) or the number of lines to skip (int) at the start of the file. If callable, the callable part will exist evaluated against the row indices, returning True if the row should be skipped and Fake otherwise.
    • skipfooter : int Number of lines at bottom of the file to skip
    • nrows : int Number of rows of file to read. Useful for reading pieces of big files.
    • na_values : scalar, str, list-like, or dict Boosted strings to recognize equally NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A Northward/A', '#NA', '-1.#IND', '-one.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '', 'Due north/A', 'NA', 'NULL', 'NaN', 'north/a', 'nan', 'null'.
    • parse_dates : bool or list of int or names or listing of lists or dict, (default False) If set to True, will endeavour to parse the index, else parse the columns passed
  • Returns: DataFrame or TextParser, A comma-separated values (CSV) file is returned as a two-dimensional information structure with labeled axes. _For total list of parameters, refer to the offical documentation

Reading CSV file

The pandas read_csv function can be used in dissimilar ways as per necessity similar using custom separators, reading but selective columns/rows and and so on. All cases are covered below ane after another.

Default Separator

To read a CSV file, call the pandas function read_csv() and pass the file path as input.

Footstep 1: Import Pandas

                      import            pandas            equally            pd        

Step 2: Read the CSV

                      # Read the csv file            df            = pd.read_csv("data1.csv")            # First 5 rows            df.head()        
read_csv file from pandas

Different, Custom Separators

By default, a CSV is seperated by comma. But you can apply other seperators as well. The pandas.read_csvoffice is not express to reading the CSV file with default separator (i.east. comma). It can exist used for other separators such as ;, | or :. To load CSV files with such separators, the sep parameter is used to pass the separator used in the CSV file.

Permit'southward load a file with | separator

          #            Read            the csv            file            sep='|'            df = pd.read_csv("data2.csv", sep='|') df                  
Custom Separators for read  _csv pandas file

Set whatever row every bit column header

Let's see the data frame created using the read_csv pandas role without any header parameter:

                      # Read the csv file            df            = pd.read_csv("data1.csv") df.caput()                  
Column header for read  _csv pandas file

The row 0 seems to be a better fit for the header. Information technology can explain improve virtually the figures in the table. You can make this 0 row as a header while reading the CSV by using the header parameter. Header parameter takes the value as a row number.

Note: Row numbering starts from 0 including column header

                      # Read the csv file with header parameter            df            = pd.read_csv("data1.csv",            header=1)            df.head()                  
Column header for read  _csv pandas file

Renaming cavalcade headers

While reading the CSV file, y'all can rename the column headers by using the names parameter. The names parameter takes the list of names of the column header.

          # Read the csv            file            with names            parameter            df            = pd.read_csv(            "data.csv"            , names=[            'Ranking'            ,            'ST Proper name'            ,            'Popular'            ,            'NS'            ,            'D'            ])            df.head()                  
Renaming Column header for read  _csv pandas file

To avoid the old header being inferred equally a row for the information frame, you can provide the header parameter which volition override the quondam header names with new names.

          # Read the csv            file            with header            and            names            parameter            df            = pd.read_csv(            "information.csv"            , header=0, names=[            'Ranking'            ,            'ST Proper noun'            ,            'Pop'            ,            'NS'            ,            'D'            ])            df.head()                  
Renaming Column header for read  _csv pandas file

Loading CSV without column headers in pandas

There is a chance that the CSV file you load doesn't have any column header. The pandas will make the get-go row equally a cavalcade header in the default example.

                      # Read the csv file            df            = pd.read_csv("data3.csv") df.head()                  
Default case without column header

To avoid whatsoever row being inferred as cavalcade header, you can specify header every bit None. It will forcefulness pandas to create numbered columns starting from 0.

                      # Read the csv file with header=None            df            = pd.read_csv("data3.csv",            header=None)            df.head()                  
Default case without column header

Calculation Prefixes to numbered columns

You tin too give prefixes to the numbered column headers using the prefix parameter of pandas read_csv role.

                      # Read the csv file with header=None and prefix=column_            df            = pd.read_csv("data3.csv",            header=None,            prefix='column_')            df.head()                  

Ready any cavalcade(s) equally Index

By default, Pandas adds an initial index to the data frame loaded from the CSV file. Yous tin control this behavior and make any column of your CSV every bit an index by using the index_col parameter.

It takes the name of the desired cavalcade which has to be fabricated every bit an index.

Example 1: Making one column equally index

          # Read the csv file            with            'Rank'            as            index df = pd.read_csv("information.csv", index_col='Rank') df.head()                  

Case two: Making multiple columns as index

For two or more columns to be fabricated equally an alphabetize, laissez passer them as a list.

          # Read the csv            file            with            'Rank'            and            'Engagement'            as            index            df = pd.read_csv("data.csv", index_col=['Rank',            'Engagement']) df.head()                  

Selecting columns while reading CSV

In practice, all the columns of the CSV file are not important. You can select only the necessary columns after loading the file only if you lot're aware of those beforehand, yous can save the space and time.

usecols parameter takes the listing of columns you want to load in your data frame.

Selecting columns using list

          #            Read            the csv file            with            'Rank',            'Date'            and            'Population'            columns (list) df = pd.read_csv("data.csv", usecols=['Rank',            'Date',            'Population']) df.head()                  
Selecting column for read_csv pandas file

Selecting columns using callable functions

usecols parameter tin likewise accept callable functions. The callable functions evaluate on column names to select that specific column where the function evaluates to Truthful.

          # Read the csv file            with            columns            where            length            of            column name >            10            df = pd.read_csv("data.csv", usecols=lambda ten: len(ten)>10) df.caput()                  
Selecting column for read_csv pandas file

Selecting/skipping rows while reading CSV

Y'all can skip or select a specific number of rows from the dataset using the pandas.read_csv office. There are 3 parameters that tin can do this job: nrows, skiprows and skipfooter.

All of them have unlike functions. Let's hash out each of them separately.

A. nrows : This parameter allows you to control how many rows you want to load from the CSV file. Information technology takes an integer specifying row count.

                      # Read the csv file with 5 rows            df            = pd.read_csv("information.csv",            nrows=five)            df                  
Selecting rows for read_csv pandas file

B. skiprows : This parameter allows y'all to skip rows from the beginning of the file.

Skiprows by specifying row indices

                      # Read the csv file with start row skipped            df            = pd.read_csv("data.csv",            skiprows=i)            df.caput()                  
Selecting rows for read_csv pandas file

Skiprows by using callback function

skiprows parameter can besides take a callable function as input which evaluates on row indices. This means the callable function will bank check for every row indices to determine if that row should be skipped or non.

                      # Read the csv file with odd rows skipped            df            = pd.read_csv("data.csv",            skiprows=lambda            ten: x%ii!=0) df.caput()                  
Selecting rows for read_csv pandas file

C. skipfooter : This parameter allows yous to skip rows from the finish of the file.

                      # Read the csv file with 1 row skipped from the end            df            = pd.read_csv("data.csv",            skipfooter=1)            df.tail()                  
Selecting rows for read_csv pandas file

Changing the data type of columns

Y'all tin can specify the data types of columns while reading the CSV file. dtype parameter takes in the dictionary of columns with their data types defined. To assign the information types, you tin import them from the numpy packet and mention them against suitable columns.

Data Type of Rank before alter

                      # Read the csv file                        df            = pd.read_csv("data.csv")            # Display datatype of Rank            df.Rank.dtypes                  
                                    dtype              ('int64')                              

Data Type of Rank after change

          #            import            numpy            import            numpy            as            np  #            Read            the csv file with data            blazon            specified for            Rank.            df            = pd.read_csv("data.csv", dtype={'Rank':np.int8})  #            Display            informationblazon            of rank            df.Rank.dtypes                  
                                    dtype              ('int8')                              

Parse Dates while reading CSV

Date time values are very crucial for data assay. You can convert a column to a datetime blazon cavalcade while reading the CSV in 2 ways:

Method 1. Make the desired cavalcade as an alphabetize and laissez passer parse_dates=True

          # Read the csv file            with            'Appointment'            as            index and parse_dates=True            df = pd.read_csv("data.csv", index_col='Date', parse_dates=Truthful, nrows=5)  # Display index df.index                  
          DatetimeIndex(['2021            -02            -25', '2021            -04            -xiv', '2021            -02            -19', '2021            -02            -24',                '2021            -02            -13'],               dtype='datetime64[ns]', name='Date', freq=None)                  

Method ii. Pass desired column in parse_dates as list

          # Read the csv file            with            parse_dates=['Date'] df = pd.read_csv("data.csv", parse_dates=['Appointment'], nrows=5)  # Display datatypes            of            columns df.dtypes                  
                      Rank            int64            Country                          object                        Population                          object                        National            Share            (%)                          object                        Date            datetime64[ns] dtype:                          object                              

Adding more NaN values

Pandas library tin handle a lot of missing values. Just there are many cases where the data contains missing values in forms that are not present in the pandas NA values list. It doesn't empathize 'missing', 'not found', or 'not available' as missing values.

And then, you need to assign them as missing. To exercise this, use the na_values parameter that takes a list of such values.

Loading CSV without specifying na_values

                      # Read the csv file            df            = pd.read_csv("data.csv",            nrows=five)            df                  
Adding NaN values

Loading CSV with specifying na_values

          # Read the csv file            with            'missing'            equally            na_values df = pd.read_csv("data.csv", na_values=['missing'], nrows=5) df                  
Adding NaN values

Convert values of the column while reading CSV

You can transform, modify, or convert the values of the columns of the CSV file while loading the CSV itself. This tin be done by using the converters parameter. converters takes in a lexicon with keys as the column names and values are the functions to be practical to them.

Let'due south catechumen the comma seperated values (i.e 19,98,12,341) of the Population cavalcade in the dataset to integer value (199812341) while reading the CSV.

                      # Role which converts comma seperated value to integer            toInt = lambda ten:            int(x.replace(',',            ''))            if            x!='missing'            else            -1            # Read the csv file                        df = pd.read_csv("information.csv", converters={'Population': toInt}) df.head()                  

Practical Tips

  • Earlier loading the CSV file into a pandas data frame, always take a skimmed await at the file. It will assist you lot judge which columns you should import and determine what data types your columns should have.
  • You should also lookout man for the full row count of the dataset. A organisation with 4 GB RAM may not be able to load 7-8M rows.

Test your knowledge

Q1: You cannot load files with the $ separator using the pandas read_csv office. Truthful or False?

Answer:

Respond: Imitation. Because, you tin use sep parameter in read_csv function.

Q2: What is the employ of the converters parameter in the read_csv function?

Respond:

Answer: converters parameter is used to modify the values of the columns while loading the CSV.

Q3: How volition you make pandas recognize that a item cavalcade is datetime blazon?

Answer:

Answer: By using parse_dates parameter.

Q4: A dataset contains missing values no, non available, and '-100'. How will you specify them as missing values for Pandas to correctly translate them? (Assume CSV file proper noun: example1.csv)

Answer:

Respond: By using na_values parameter.

                          import              pandas              as              pd  df = pd.read_csv("example1.csv", na_values=['no',              'not available',              '-100'])                      

Q5: How would you read a CSV file where,

  1. The heading of the columns is in the tertiary row (numbered from 1).
  2. The terminal 5 lines of the file take garbage text and should be avoided.
  3. Only the column names whose first alphabetic character starts with vowels should be included. Assume they are one word merely.

(CSV file name: example2.csv)

Answer:

Answer:

                          import              pandas              as              pd  colnameWithVowels = lambda              x:              10.lower()[0]              in              ['a',              'e',              'i',              'o',              'u']  df = pd.read_csv("example2.csv", usecols=colnameWithVowels, header=2, skipfooter=v)                      

The article was contributed by Kaustubh G and Shrivarsheni

koehlereigerstand.blogspot.com

Source: https://www.machinelearningplus.com/pandas/pandas-read_csv-completed/

0 Response to "Reading a Column Pandas From Read Csv"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel