Comfortable Data Science (Part I)

A comfy Data Scientist

Anyone who has seriously dabbled with Data Science, Machine Learning or Data Analytics in Python understands that the work process can be exciting, frustrating, difficult but ultimately rewarding. However, I have never heard anyone say that the process was comfortable. We seek comfort in many aspects of life; we invest in comfortable couches, snug blankets and the list goes on. Is it possible to feel like we’re wrapped in a snug blanket while doing DS in Python? What would that even feel like ?

To me, comfort implies that our common pain points and repetitive tasks are well managed. In my experience, here are the annoying bits of DS in Python:

  • Environment & Package Management ( Part I)
  • Reliable and meaningful code editor (Part I)
  • Model versioning and comparison (Part II — forthcoming)
  • Task Scheduling and Automation (Part II — forthcoming)
  • Slow or inefficient code execution (Part III — forthcoming)

While some of these are not pain points specific to DS in Python, they all certainly infuriate us from time to time, sometimes everyday. Laptops have been smashed and broken for much less.

In this three-part blog series, I will detail and showcase my approach to each of the mentioned pain points. Today, I will discuss two nuisances that appear at the beginning of the DS journey: package management and code editors.

Environment and Package Management

At the outset I would like to submit that I am not a fan of Anaconda, and will not be demonstrating any Anaconda workflows in this blog series. I prefer to use pip and virtualenv as this offers me more control over my environment. The only time I reluctantly use Miniconda is when a package has a long list of difficult dependencies and simply cannot be installed via pip.

So how do you start on your shiny new project? First, create a virtual environment. Remember, always give a meaningful name to the environment so you can search for it later. Pro Tip: virtual environments appear as folders so to keep track of them I keep all my environments in a single folder.

Setting up and activating a virtual enviornment

After creating the virtual environment, we need to install our required dependencies. Write your requirements.txt file (I use nano for this). Here we can either mention specific package versions we want, otherwise the latest version will be installed.

Requirments.txt file in nano

Finally, install the packages with the command pip install -r requirements.txt.

Since all of us here Jupyter Notebook junkies, the next logical step is to ensure that we can work the environment in Jupyter. We will use the ipykernel package (which should be in your requirements.txt) to complete this: ipython kernel install — user — name=comfortable_ds

Once you open up a jupyter notebook you will notice a kernel with the project name there.

Jupyter Notebook with custom kernel

Code Editor

Now while Jupyter notebooks are a decent environment for quick experiments, it cannot be used to write production-grade code. First and foremost, it is difficult to version control in Jupyter. While some workarounds exist like nbdiff, there just isn’t anything like good ol’ git. However its impossible for us Data Scientists to simply throw away Jupyter, so is there any common environment that offers Jupyter and a smart text editor? Enter VS code.

Visual Studio Code (VS Code) is one of, if not the most, popular code editors in the world, with git integration, in-screen terminal, easy to use python linters and loads of other customization. VS Code can really make life comfortable. Today I am going to discuss benefits relevant to a Data Scientist/Machine Learning Engineer. Namely,

  • Git Integration
  • Python Linter
  • Integrated Jupyter Notebooks
  • Remote Server Access (this one makes me really comfortable)

Most of the lengthy code we have should be versioned with git. I’ve burned my fingers one too many times with losing bits of code here and there — it can be positively annoying. We don’t require services like Github or Gitlab to version our code; we can use local git repositories for versioning and committing. VS code ships with basic git functionality and with a few keystrokes, we can start gitting away (please ensure you have git installed on your system).

VS Code with Git, Terminal and Editor

Using the in-built terminal, I simply initialize a git repository with git init. To begin working on the repository, I need to select the git symbol on the left-hand side and open the desired folder (in this case, comfortable_ds).

To stage and commit any files I write in this repository, I can simply use the given buttons instead of writing git add etc. every time.

Using Git in VS Code

By first staging the changes, and then clicking on the tick-mark symbol (highlighted in a red box), voila! I have committed my code.

In-built Python Linter

Won’t spend too much time here but if you don’t use linters yet, do give them a shot. All the annoying bugs in our code (a typo in our variable names, calling functions we have not imported yet etc. ) can be easily debugged with a linter.

Using Python Linter: Run -> Start Debugging (F5)

This linter will highlight the first bug it catches a per the control flow. Here as you see, when I call the subtractor function, NameError is caught.

For longer pieces of code, if you only want to debug a certain section, we can use breakpoints (red dots below) to define which section of code are of interest.

Integrated Jupyter Notebooks

Using Jupyter Notebooks in VS code is very easy. All we need to do is open and save a new file with the .ipynb extension. Thats it!

Jupyter Notebook support in VS Code

Once we have the notebook open, on the top right of the page we can choose which kernel we would like. Here I am using the same kernel we setup earlier (comfortable_ds). However, if you want to use another virtual environment, click on the kernel button and choose your desired virtual environment directory path from the drop down.

Remote SSH

I’m sure some people reading this article are thinking, “Psh.. Most of these things can be done in JupyterLab, why do I need to use VS code?” To you, I offer my final trump card.

For those of us who need to experiment and test code in large remote servers because of the sheer size of data, an annoying situation is reached. It can be really annoying going back and forth between our local environment where we would like to edit our code and the remote server, where we want to run our code. We either have to constantly git push and pull code or for those brave souls, use vim to write code directly on the server. Frightening. Wouldn’t it be ever so comfortable to be able to write and run code remotely in the VS Code? It would and it is.

To set this up, we first need to provide our instance details. Since I am comfortable with AWS, I will be using some AWS terminology. First, we click on the orange button on the bottom left corner, then choose “Connect to Host”.

Starting Remote-SSH on VS Code

Then we select “Configure SSH Host”. Here we need to have a config file which has our instance details where we provide the instance name, address and the location of the .pem file.

Config file for Remote-SSH

Once we have this file, we need to specify the path to the config file under “Remote.SSH: Config file”.

Now we can SSH using “Connect to a Host”, VS code will try SSHing to the server and the files on the server should be visible. Something like this.

Example view of VS Code workspace from remote server

For a more detailed setup, please checkout

Truly comfortable

With our virtual environments ready to go and an editor which is more customizable than our own home, we are truly comfortable, at least to me.

Once we are nice and cozy, we can go ahead and make our data pipelines, models etc. But even building and storing models can be very repetitive and annoying and some of us can really do with more automation and scheduling. Stay Tuned!

Data Scientist at Wipro Digital. GIS enthusiast.