Skip to main content

Command Palette

Search for a command to run...

Why Python is the Most Preferred Language for Data Science

Updated
17 min read
T
Profile: https://thomascherickal.com Portfolio: https://thomascherickal.github.io Blog: https://hackernoon.com/u/thomascherickal Presence: https://linktr.ee/thomascherickal LinkedIn: https://linkedin.com/in/thomascherickal GitHub: https://github.com/thomascherickal Email: thomascherickal@gmail.com

There Are Innumerable Reasons

The Python Programming Language

Python is a high-level, interpreted programming language known for its simplicity, readability, and ease of use. It was created by Guido van Rossum and released in 1991. Python is an object-oriented language, which means it uses objects to represent data and functionality.

Python is dynamically typed, meaning you don't need to declare data types when defining variables, and it features automatic memory management, which frees programmers from the need to manually manage memory allocation and deallocation.

Python's syntax is designed to be readable and intuitive, with an emphasis on using natural language constructs to write code. It uses whitespace to structure code blocks, rather than curly braces or keywords like "begin" and "end."

The ease of use and simplicity became big selling points after Python 2 was released. Python went through a number of iterations on its journey to todays 3.11 version, which is the fastest and most efficient of all the releases so far from Python.org.

There is also a dialect of Python called PyPy which offers actual High-Performance Computing (HPC) capabilities, especially when combined with the numba and the numpy Python libraries.

The story of the evolution of Python from Python 1.0 to 3.11 is given below, in a summarised form.

The Evolution of Python

The evolution of the Python programming language can be divided into several major versions:

  • Python 1.x: This was the first version of Python, released in 1991. It was a simple language with basic data types and structures, but it quickly gained popularity due to its ease of use and readability.
  • Python 2.x: This version of Python was released in 2000 and introduced several new features, including list comprehensions, iterators, and generators. It also included significant improvements to the standard library.
  • Python 3.x: This version was released in 2008 and introduced several major changes to the language, including a new way of handling Unicode strings, improvements to the standard library, and syntax changes that made the language more consistent and easier to use. However, it was not fully backward compatible with previous versions of Python, which caused some compatibility issues.
  • Python 3.2 to 3.11 (current version as of date): These versions introduced various new features and improvements, such as better Unicode handling, improved syntax for exception handling, and better performance, especially in the Python 3.11 which introduced and adaptive interpreter and addition JIT (just-in-time) compilation capabilities.

The backward compatibility issues were mostly fixed with a simple script called 2to3.py. If that script was run in a directory, all the Python code files in that directory and its subdirectories would be automatically converted from Python 2 to Python 3.

The adaptive interpreter introduced in Python 3.11 gave a speed up, on average, of 22%, and overall, between 10%-60%, which is a fantastic achievement in itself. Also, modern libraries can use the GPUs (Graphical Processing Units) present in modern computing systems to leverage even additional speed-up, especially for machine learning and data science.

Python has been a highly popular language ever since its release and its large and active community of developers has contributed to the creation of a wide range of libraries and frameworks, which brings us to the next feature of Python – its unrivalled large codebase of community-created libraries. But first let’s answer a simpler question. Why is Python so popular?

The Popularity of Python

Example 1 – Hello World

Consider a ‘Hello World!’, program in Java, the most commonly taught first programming language in the past. It reads like this:

public class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello, World!");

}

}

HelloWorld.java

Which prints ‘Hello World!’ on the screen.

Now let’s do it in Python:

print(“Hello, World!”)

hello.py

Which does the exactly same thing!

Example 2 – Sorting Random Numbers with Quicksort

Lets generate a list of 10 random numbers and sort them using a famous sorting technique called quicksort.

In Java 17 LTS:

import java.util.Arrays;

import java.util.Random;

public class Quicksort {

public static void main(String[] args) {

Random rand = new Random();

int[] arr = new int[10];

for (int i = 0; i < arr.length; i++) {

arr[i] = rand.nextInt(100);

// generate a random number between 0 and 99

}

System.out.println("Before sorting: " + Arrays.toString(arr));

quicksort(arr, 0, arr.length-1);

System.out.println("After sorting: " + Arrays.toString(arr));

}

public static void quicksort(int[] arr, int left, int right) {

if (left < right) {

int pivotIndex = partition(arr, left, right);

quicksort(arr, left, pivotIndex-1);

quicksort(arr, pivotIndex+1, right);

}

}

public static int partition(int[] arr, int left, int right) {

int pivot = arr[right];

int i = left - 1;

for (int j = left; j < right; j++) {

if (arr[j] < pivot) {

i++;

int temp = arr[i];

arr[i] = arr[j];

arr[j] = temp;

}

}

int temp = arr[i+1];

arr[i+1] = arr[right];

arr[right] = temp;

return i+1;

}

}

RandomNumbers.java

In Python 3.11:

import random

def quicksort(arr):

if len(arr) <= 1:

return arr

else:

pivot = arr[0]

left = [x for x in arr[1:] if x < pivot]

right = [x for x in arr[1:] if x >= pivot]

return quicksort(left) + [pivot] + quicksort(right)

arr = [random.randint(0, 99) for _ in range(10)]

print("Before sorting: ", arr)

arr = quicksort(arr)

print("After sorting: ", arr)

quicksort.py

Compare the programs. Now do you se why Python is simple and Java is often called ‘obfuscated’ (a word that means unnecessarily complicated).

Python reads like English!

Java is, definitely, code.

One more example (absolutely stunning!):

Example 3 – Machine Learning

The first using Java 17 to do a deep neural network classification problem on the wine data set using the DeepLearning4J Java machine learning library.

import org.datavec.api.records.reader.RecordReader;

import org.datavec.api.records.reader.impl.csv.CSVRecordReader;

import org.datavec.api.split.FileSplit;

import org.datavec.api.transform.TransformProcess;

import org.datavec.api.transform.schema.Schema;

import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;

import org.deeplearning4j.datasets.iterator.impl.ListDataSetIterator;

import org.deeplearning4j.eval.Evaluation;

import org.deeplearning4j.nn.conf.MultiLayerConfiguration;

import org.deeplearning4j.nn.conf.NeuralNetConfiguration;

import org.deeplearning4j.nn.conf.layers.DenseLayer;

import org.deeplearning4j.nn.conf.layers.OutputLayer;

import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;

import org.deeplearning4j.optimize.listeners.ScoreIterationListener;

import org.deeplearning4j.util.ModelSerializer;

import org.nd4j.linalg.activations.Activation;

import org.nd4j.linalg.api.ndarray.INDArray;

import org.nd4j.linalg.dataset.DataSet;

import org.nd4j.linalg.dataset.SplitTestAndTrain;

import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;

import org.nd4j.linalg.dataset.api.preprocessor.DataNormalization;

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerStandardize;

import org.nd4j.linalg.lossfunctions.LossFunctions;

import java.io.File;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Arrays;

import java.util.List;

public class WineClassifier {

public static void main(String[] args) throws IOException, InterruptedException {

// Define the input and output files

File inputFile = new File("wine.csv");

File outputFile = new File("wine_model.zip");

// Define the schema for the input data

Schema schema = new Schema.Builder()

.addColumnInteger("Class")

.addColumnsFloat("Alcohol",

"Malic_acid",

"Ash",

"Alcalinity_of_ash",

"Magnesium",

"Total_phenols",

"Flavanoids",

"Nonflavanoid_phenols",

"Proanthocyanins",

"Color_intensity",

"Hue",

"OD280/OD315_of_diluted_wines", "Proline")

.build();

// Create a record reader to read the input CSV file

RecordReader recordReader = new CSVRecordReader(1);

recordReader.initialize(new FileSplit(inputFile));

// Create a transform process to preprocess the input data

TransformProcess transformProcess = new TransformProcess.Builder(schema)

.categoricalToOneHot("Class")

.removeColumns("Class[1]")

.build();

// Create a dataset iterator to read and preprocess the input data

DataSetIterator dataSetIterator = new RecordReaderDataSetIterator.Builder(recordReader, 178)

.regression(0, 2)

.build();

DataSet dataSet = dataSetIterator.next();

dataSet.shuffle();

SplitTestAndTrain testAndTrain = dataSet.splitTestAndTrain(0.65);

DataSet trainingData = testAndTrain.getTrain();

// Get the normalization parameters from the training data

DataNormalization normalizer = new NormalizerStandardize();

normalizer.fit(trainingData);

normalizer.transform(trainingData);

// Create a list of listeners to track the training progress

List<org.deeplearning4j.optimize.api.TrainingListener> listeners = new ArrayList<>();

listeners.add(new ScoreIterationListener(10));

// Train the neural network

int numInputs = 13;

int numOutputs = 3;

int numHiddenNodes = 10;

MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()

.seed(123)

.updater(new org.nd4j.linalg.learning.config.Adam())

.list()

.layer(0, new DenseLayer.Builder().nIn(numInputs).nOut(numHiddenNodes)

.activation(Activation.RELU)

.build())

.layer(1, new DenseLayer.Builder().nIn(numHiddenNodes).nOut(numHiddenNodes)

.activation(Activation.RELU)

.build())

.layer(2, new OutputLayer.Builder(LossFunctions.LossFunction.MEAN_SQUARED_LOGARITHMIC_ERROR)

.activation(Activation.SOFTMAX)

.nIn(numHiddenNodes).nOut(numOutputs).build())

.build();

MultiLayerNetwork model = new MultiLayerNetwork(config);

model.setListeners(listeners);

model.init();

model.fit(new ListDataSetIterator(trainingData.asList()));

// Evaluate the performance of the model on the test data

Evaluation evaluation = new Evaluation(numOutputs);

INDArray output = model.output(testData.getFeatures());

evaluation.eval(testData.getLabels(), output);

System.out.println(evaluation.stats());

// Save the model to a file

ModelSerializer.writeModel(model, outputFile, true);

}

WineClassifier.java

That is how it is done in Java. Note the number of lines and the word count (well over 300+ words) and the complexity of the process.

If there is one single error in one single character of the 300 words – you won’t be able to compile it, or, worse, you may run it with a run-time error.

Person typing on laptop

Now let’s do the same thing in Python, using scikit-learn, the most commonly used Python Machine Learning library:

import pandas as pd

import numpy as np

from sklearn.datasets import load_wine

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier

# Load the wine dataset

data = load_wine()

X = data['data']

y = data['target']

# Prepare the data

X_train, X_test, y_train, y_test =

train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

# Train the neural network

clf = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, alpha=0.01,solver='adam', verbose=False, tol=1e-4, random_state=42, learning_rate_init=0.01)

# Run the classifier on the train data

clf.fit(X_train, y_train)

# Evaluate the performance of the model on the test data

accuracy = clf.score(X_test, y_test)

print(f"Accuracy: {100 * accuracy:.2f}%")

wine_classifer.py

103 words. And simple to read. I rest my case.

Statistics and Miscellaneous Information about Python

Now, that last example alone will have given you a deep insight into why Python is so much more preferred over Java, C, C++, and the .NET ecosystem.

But is that the story for all applications of Python?

The answer is definitely, yes.

We created a summary from the article, Awesome Statistics of Python in 2023, taken from the following link:

https://leftronic.com/blog/python-statistics/

  1. Python is very similar to English.
  1. Companies that use Python: Google, Instagram, Facebook, Quora, Spotify, Netflix, Reddit, NASA, IBM.
  1. 56% of Python developers work on their own projects independently.
  1. Four out of five developers claim that Python is their main language.
  1. 1.4% of all websites on the internet use Python as a server-side programming

language.

  1. Python has been downloaded 23 million times for Windows alone.
  1. Python developers outnumber Java developers (over 10 million developers worldwide).
  1. A Python developer can earn as much as $118,000 a year.
  1. A Python senior or chief data scientist can earn $400,000+ a year.
  1. There are more than 70,000 Python jobs currently available.
  1. Python has greatly influenced the development of Java 5 to JDK 17 LTS.
  1. Python is one of the official languages used by Google.
  1. Python is an Open-Source language.
  1. Python is mainly used as a hobby for game development.
  1. There are 147,000 packages in the Python package repository (PyPI – Python package index).
  1. They have been downloaded over 2 billion times in total.

Pastel workspace setup

Applications of Python

  1. Web Development: Python has several popular web frameworks such as Django, Flask, Pyramid, and CherryPy that make it easy to build dynamic and scalable web applications with clean, efficient code.
  1. Data Science: Python has become the preferred language for data science and analysis due to its vast array of libraries, including NumPy for numerical computations, Pandas for data manipulation, and Matplotlib for data visualization, making it an ideal choice for machine learning and artificial intelligence applications.
  1. Natural Language Processing: Python has several powerful NLP libraries such as NLTK, spaCy, and TextBlob that allow users to perform a wide range of natural language processing tasks, such as sentiment analysis, language detection, and text classification.
  1. Computer Vision: Python's OpenCV library is a popular computer vision library that provides easy-to-use tools for image processing and analysis, making it ideal for applications such as facial recognition, object detection, and image segmentation.
  1. Automation: Python can be used for automating a wide range of tasks, from web scraping and data collection to network automation and system administration, with popular libraries such as Selenium, Beautiful Soup, and PyAutoGUI.
  1. Internet of Things: Python is becoming increasingly popular for IoT applications, with libraries like Adafruit, PyCom, and PyBoard providing tools for IoT development, data collection, and analysis.
  1. Audio and Music: Python has several libraries such as LibROSA, PyDub, and FFMPEG that enable users to work with audio files, perform signal processing, and even create music.
  1. Finance: Python is widely used in finance for tasks such as quantitative analysis, risk management, and algorithmic trading, with popular libraries such as Pandas, NumPy, and QuantLib.
  1. Image and Video Editing: Python has several libraries such as Pillow and OpenCV that provide tools for image and video editing, such as resizing, cropping, filtering, and color correction.
  1. Cryptography and Security: Python has several libraries such as PyCrypto, Cryptography, and Paramiko that enable users to perform cryptography tasks, create secure connections, and develop security tools.
  1. Scientific Computing: Python has several libraries such as SciPy, SymPy, and Pyomo that provide tools for scientific computing, optimization, and modeling.
  1. Education: Python is widely used in education as an introductory programming language due to its simplicity and ease of use, with popular teaching tools such as Turtle Graphics and Codecademy.
  1. Testing and Quality Assurance: Python has several libraries such as PyTest, Nose, and Robot Framework that enable users to perform automated testing and quality assurance tasks.
  1. Geographic Information Systems (GIS): Python has several libraries such as Geopandas, Shapely, and Fiona that provide tools for working with geographic data, such as mapping, spatial analysis, and geocoding.
  1. Marketing and Advertising: Python can be used for marketing and advertising tasks such as web scraping, data analysis, and automation, with libraries such as Selenium, BeautifulSoup, and Pandas.
  1. E-commerce: Python can be used for e-commerce tasks such as building online stores, payment processing, and order management, with popular frameworks such as Django and Flask.
  1. Chatbots: Python has several libraries such as ChatterBot, Rasa, and BotBuilder that enable users to create chatbots and conversational agents for a wide range of applications.
  1. Social Media: Python can be used for social media tasks such as data analysis, automation, and sentiment analysis, with libraries such as Tweepy, PySocialWatcher, and TextBlob.
  1. Big Data: Python has several libraries such as PySpark, Dask, and Vaex that enable users to work with large datasets and perform distributed computing, making it an ideal language for big data applications.
  1. Cybersecurity: Python is widely used in cybersecurity for tasks such as network scanning, intrusion detection, and malware analysis, with popular libraries such as Scapy, PyPCAP, and Malwarebytes.
  1. Human Resources: Python can be used for tasks such as data analysis, automation, and employee management, with libraries such as Pandas, NumPy, and Tkinter.
  1. Energy and Utilities: Python can be used for tasks such as energy forecasting, smart grid management, and demand response optimization, with libraries such as Pyomo, Pandas, and Scikit-learn.
  1. Healthcare: Python is increasingly being used in healthcare for tasks such as medical image analysis, clinical decision support, and disease prediction, with libraries such as PyRadiomics, TensorFlow, and Scikit-learn.
  1. Agriculture: Python can be used for tasks such as crop yield prediction, soil analysis, and precision agriculture, with libraries such as Pandas, NumPy, and GeoPandas.
  1. Quality Control: Python can be used for quality control tasks such as statistical process control, defect analysis, and root cause analysis, with libraries such as SciPy, NumPy, and Pandas.
  1. Aerospace and Defense: Python can be used for tasks such as aircraft design, space mission planning, and missile guidance, with libraries such as PyDSTool, SymPy, and NumPy.
  1. Social Sciences: Python can be used for social science tasks such as data analysis, text mining, and network analysis, with libraries such as NetworkX, NLTK, and Scikit-learn.
  1. Transportation: Python can be used for transportation tasks such as route optimization, traffic prediction, and autonomous vehicle control, with libraries such as Pandas, NumPy, and TensorFlow.
  1. Sports: Python can be used for sports analytics, such as player performance analysis, team strategy optimization, and sports betting, with libraries such as PySport, NumPy, and Pandas.
  1. ChatGPT and LLMs: ChatGPT is an example of a Generative Pre-trained Transformer. Its high level text processing systems are built with Python, and the lower levels are optimized for speed with C++. Most LLMs follow the same structure.

.

.

.

.

.

.

.

.

.

.

I could go to over 100 applications, and I still wouldn’t be finished.

Python is simple, easy to learn, easy to work with, and has the largest number of custom-built libraries for any language – both by commercial communities and open source communities, and in some cases, individuals (e.g, Keras and François Chollet from Google).

Woman using tablet standing in front of digital display

Python and Data Science

So now you have a pretty good idea of why Python is so popular. How did it become the de facto language for data science?

Python has become the de facto language for data science because it offers a range of features that make it an ideal choice for data science tasks.

One of the main reasons for this is its ease of use and readability. Python has a simple syntax that is easy to learn and understand, making it accessible to programmers of all levels of experience.

Python's open-source nature has also contributed to its popularity.

Being open-source, Python can be easily extended with libraries and tools developed by other developers.

These libraries and tools make it easier to accomplish complex tasks.

Even data analysis, visualization, and machine learning, is done without having to write code from scratch.

Python's libraries for data science, such as NumPy, Pandas, and Matplotlib, are some of the most widely used libraries in the field.

Python's interoperability with other languages and tools is also a major factor in its popularity in data science.

Python can be easily integrated with other tools and platforms used in data science, such as Hadoop and Spark.

It can be used in conjunction with other popular data science languages, such as R and Julia.

The Jupyter Notebook and the JupyterLab app are good examples of environments that support Julia, Python, and R, all in the same notebook.

Top 10 Best Certifications to learn Python for Data Science

  1. Professional Certificate in Python Data Science – (Course) edX

There is absolutely no doubt about this. edX is the world’s best remote learning platform and a professional certificate in data science would do better for your chances to get a job as a data engineer of a data scientist than even a CS degree costing well over six figures.

  1. Programming for Data Science with Python – (Course) Udacity

This is a course from Udacity, another prestigious course content creator, and the content is optimized for beginners. A good course to start with.

  1. Statistics Fundamentals with Python – (Specialization) DataCamp

If you don’t know statistics, you don’t know data science. Data science has statistics as a cornerstone. All of 19 hours and that too, for free!

  1. Data Analyst with Python – (Learning Path) DataCamp

66 hours and 19 courses of free learning! If you want to upskill with quality content for free don’t look anywhere else – this is perfect!

  1. IBM AI Engineering Professional Certificate – (Certification) Coursera

To put it simply, in the market of online courses, MasterTracks, Specializations and Certificates, this is the number one course you can take. This provides all the information and the expertise necessary to begin an enriching career as a data scientist.

  1. Tableau Fundamentals – (Learning Path) DataCamp

You may wonder, in an article about Python and Data Science, why are we adding a course on Tableau? Well, as good as matplotlib, bokeh, and seaborn may be, to present your points to the user you need to know how to create an interactive dashboard, for which this is the best tool.

  1. Python Programmer – (Learning Path) DataCamp

This is the number one Learning Path that I would recommend to a beginner who is brand new to the field. In over 59 hours of 15 courses, you will literally go from Python zero to hero in your data science journey.

  1. Introduction to Data Science with Python – (Specialization) Coursera

Of course, no list on Python and data science would be complete without the most famous Python and Data Science specialization of all, the one offered by the University of Michigan. By all means, and especially if you are a beginner, sign up for this set of courses.

  1. IBM Data Science Professional Certificate – (Certification) Coursera

This is a certification from IBM and holds a very high value. The theory given here is especially useful, since they cover all the advanced topics you need to know to be a successful data scientist. And Python is included as well!

  1. Data Science on Microsoft Azure with Python Programming – (Certification) Future Learn

Again, you may ask, why cloud computing? The answer is that you cannot compete in the IT field on the world stage if you don’t know at least one platform for cloud deployment of your Python code. This is a fantastic certification and very well put together.

Woman smiling and using laptop

Future Outlook

Python’s popularity in every field will only increase as time goes by.

Google has over 2 billion lines of code in its monolithic codebase, which makes it the largest in the world in terms of LOC.

Python has a significant share in that codebase.

Python is simple to learn.

Also, easy to read and understand.

With a handy feature called virtualenv, you can separate differing versions of Python. (venv)

Docker is written in Go but has extensive use with Python in containerized projects.

The largest open-source community in the world belongs to Python.

Python has applications for blockchain, Web 3.0, and quantum computing. Which I did not even mention!

And definitely, data science.

At least for the next 10 years.

Although the way technology is progressing, we could see quantum leaps in technology very soon!

And undoubtedly, the best Python programmer in the world is ChatGPT.

Let that sink in for a minute.

After years and years, we are entering what was the domain of science fiction – software building software.

And for all of that – Python will be your silver bullet.

Learn Python. As well as you can.

It could be the beginning of a six-figure salary in your career.

In US dollars.

All the very best of luck to you.

Skyscrapers shown from view looking up