I am sure many of you might have used the humble but powerful Microsoft Excel. You have worked with tables, rows and columns at some point or the other. When it comes to Data Science, there are many things to be done with these rows and columns. You could very well do that with Excel and other similar tools. But, because of the graphical intensive features of such tools, they get slower in processing huge amounts of data. They are great tools, no doubt, but when applying scientific techniques on massive data sets, they fall short.
Data-specific packages in statistical and scientific computing tools like R and Python are far more efficient and quicker than popular productivity tools. They are designed to crunch massive data sets in every possible way in a matter of fractions of a second, and if you know how to work with these packages, it is a really quick way to apply scientific techniques and extract knowledge out of the data. One such package is pandas data science(no, it does not start with a capital P). It is one of the basic but a very useful and powerful package that data scientists and analysts using the Python environment work on today.
The name pandas are derived from the word “panel data” a term used in econometrics for data sets that hold observations over multiple time periods. One of the first stepping stones to conquer data analysis or even data science in Python is pandas. Well not exactly, you need to have an idea bout Numpy too, another package that defines essential data structures for numerical gymnastics and on which pandas is based.
In this article let us look at:
Pandas is an open-source library providing high performance, easy to use data structures for use in Python. There are 2 data structures that pandas offer:
Series is a one-dimensional (one row or one column-whichever way you look at it) structure that can hold homogeneous data. This means each element in the data structure needs to be of the same type. If it is about numbers, all the elements in the data structure should be numbered.
DataFrame is a two-dimensional data structure, capable of holding heterogeneous data, in rows and columns or tabular format. Unlike Series, heterogeneous data means each element stored in the data structure need not be of the same type. To start with, DataFrames are two-dimensional, but with nesting, it can be viewed as having more than 2 dimensions.
pandas is a data analysis package, and it is used for exactly that. It also finds use in the initial phases of Machine Learning, where there is a lot of exploratory data analysis, data cleaning and data wrangling. With a wide compatibility net, you can import data from popular application format like XLS, XLSX(Excel), JSON, CSV (comma separated values), flat files and more. You can even read data from a database table into pandas data structures.
There are many operations possible at a row-level, column level, or on the entire DataFrame or Series, like merging, re-shaping, scalar operations, and operations between DataFrames and Series. With its native ability to crunch numbers, pandas are good for financial applications like financial time series or economics data. It can also be used for stock market analysis.
To start using pandas, you will need to get used to Numpy. It is a Python library that at its core provides a multi-dimensional array object with the ndarray, various derived objects like masked arrays and matrices, packaged together with a number of array-based routines or methods to manipulate the data contained in these arrays. Mathematical operations on arrays like transposing, shape manipulation, sorting, discrete Fourier transforms, linear algebra, statistical operations and random number simulations are possible with Numpy.
While it is unnecessary to import Numpy before you start using pandas in Python, it is a good starting base to understand pandas better.
Let’s look at how to start using pandas in python. We shall not step into coding details here.
Python-like, many other multipurpose languages are dependent on libraries that are built by community contributors. Python has its basic coding framework, which supports the variable declaration of supported data types, conditional structures, looping structures, operands, expressions and statements. If you are not familiar with programming, don’t worry, these are programming constructs. Here is a simple, very high-level flowchart that will generally apply to use of pandas in python.
Maybe some code would also do good to put things in perspective. Here it is
In a Python script, assuming that you have pandas installed, use the below statement.
Import pandas as pd.
pd is the variable that refers to your pandas library. You can now use pd to access the data Structures and methods that pandas offer.
You then create a data structure, let’s say a DataFrame, using the below statement.
df=pd.DataFrame(). You just created a DataFame and gave it the name df. This is empty right now. You could have used some existing data in your script to initialize the DataFrame. Assuming that you have a python dictionary with data in it, here is how you can use the dictionary to create a pandas DataFrame and populate it with values.
df=pd.DataFrame(my_dict), where my_dict is the python dictionary with already existing values. There are many other ways that you can get the same result. So, you may want to get creative and look for other ways to do the same thing.
Now the same data is residing in a DataFrame named df, and you just got more power to do all sorts of operations on this data with the methods built into pandas DataFrame.
For example, you could use the transpose method for pandas DataFrames.
pandas Series is a one-dimensional, labelled array, a homogeneous data structure which can be of any data type or object type recognized by your python script. So, it can be Series of integers or a Series of strings or a Series of Series or a Series of DataFrames.
Let’s go through some simple code structures to understand some basics on Series.
Assuming that you have already imported pandas and assigned it a name, in our case, it is pd. The below statement will create a Series data structure and assign it a name.
srs=pd.Series() . The new Series is empty as of now.
You could use the random package of NumPy to generate a single-dimensional array of random numbers like this.
now random_arrray refers to a one-dimensional array containing 10 randomly generated values.
We can now use this array to initialize our pandas Series like so.
This was just a simple example; you could also import data from a file into the Series for practical reasons. If you are trying out to learn pandas, you could also enter the values individually. You could even use Python lists to create and initialize the Series.
DataFrame is a 2-dimensional data structure capable of holding heterogeneous data, similar to an Excel spreadsheet or database table. The datatypes that can be used to create the DataFrames include any data type recognized in Python. The values in the DataFrame could also be of type object.
Just like excel, pandas DataFrame allows data to be sliced and diced, pivoted, spit, formatted, cleaned, transformed and much more. The difference is, it is much quicker in a DataFrame if you know the right commands and techniques. Once you learn it, it is a breeze, besides pandas data structures are efficient at handling data, massive data.
Let’s quickly go through some simple code for better understanding.
this will give you a pandas DataFrame containing values in my_list, a list object in Python.
You could also import data into the DataFrame from external sources like a flat file or an Excel spreadsheet. It could also be loaded from a database table.
A simple operation on the DataFrame would be a transpose like so.
dft=df.transpose() or df.T, either will work.
pandas are easy to pick up. If you think of pandas data structures as a command-line way of dealing with a spreadsheet, you might grasp it quicker.
If you are interested in making a career in the Data Science domain, our 11-month in-person Post Graduation in Data Science course can help you immensely in becoming a successful Data Science professional.