Anyone who has seriously dabbled with Data Science, Machine Learning or Data Analytics in Python understands that the work process can be exciting, frustrating, difficult but ultimately rewarding. However, I have never heard anyone say that the process was comfortable. We seek comfort in many aspects of life; we invest in comfortable couches, snug blankets and the list goes on. Is it possible to feel like we’re wrapped in a snug blanket while doing DS in Python? What would that even feel like ?
To me, comfort implies that our common pain points and repetitive tasks are well managed. In my experience, here are the annoying bits of DS in Python:
- Environment & Package Management ( Part I)
- Reliable and meaningful code editor (Part I)
- Model versioning and comparison (Part II — forthcoming)
- Task Scheduling and Automation (Part II — forthcoming)
- Slow or inefficient code execution (Part III — forthcoming)
While some of these are not pain points specific to DS in Python, they all certainly infuriate us from time to time, sometimes everyday. Laptops have been smashed and broken for much less.
In this three-part blog series, I will detail and showcase my approach to each of the mentioned pain points. Today, I will discuss two nuisances that appear at the beginning of the DS journey: package management and code editors.
Environment and Package Management
At the outset I would like to submit that I am not a fan of Anaconda, and will not be demonstrating any Anaconda workflows in this blog series. I prefer to use pip and virtualenv as this offers me more control over my environment. The only time I reluctantly use Miniconda is when a package has a long list of difficult dependencies and simply cannot be installed via pip.
So how do you start on your shiny new project? First, create a virtual environment. Remember, always give a meaningful name to the environment so you can search for it later. Pro Tip: virtual environments appear as folders so to keep track of them I keep all my environments in a single folder.
After creating the virtual environment, we need to install our required dependencies. Write your requirements.txt file (I use nano for this). Here we can either mention specific package versions we want, otherwise the latest version will be installed.
Finally, install the packages with the command pip install -r requirements.txt.
Since all of us here Jupyter Notebook junkies, the next logical step is to ensure that we can work the environment in Jupyter. We will use the ipykernel package (which should be in your requirements.txt) to complete this: ipython kernel install — user — name=comfortable_ds
Once you open up a jupyter notebook you will notice a kernel with the project name there.
Now while Jupyter notebooks are a decent environment for quick experiments, it cannot be used to write production-grade code. First and foremost, it is difficult to version control in Jupyter. While some workarounds exist like nbdiff, there just isn’t anything like good ol’ git. However its impossible for us Data Scientists to simply throw away Jupyter, so is there any common environment that offers Jupyter and a smart text editor? Enter VS code.
Visual Studio Code (VS Code) is one of, if not the most, popular code editors in the world, with git integration, in-screen terminal, easy to use python linters and loads of other customization. VS Code can really make life comfortable. Today I am going to discuss benefits relevant to a Data Scientist/Machine Learning Engineer. Namely,
- Git Integration
- Python Linter
- Integrated Jupyter Notebooks
- Remote Server Access (this one makes me really comfortable)
Most of the lengthy code we have should be versioned with git. I’ve burned my fingers one too many times with losing bits of code here and there — it can be positively annoying. We don’t require services like Github or Gitlab to version our code; we can use local git repositories for versioning and committing. VS code ships with basic git functionality and with a few keystrokes, we can start gitting away (please ensure you have git installed on your system).
Using the in-built terminal, I simply initialize a git repository with git init. To begin working on the repository, I need to select the git symbol on the left-hand side and open the desired folder (in this case, comfortable_ds).
To stage and commit any files I write in this repository, I can simply use the given buttons instead of writing git add etc. every time.
By first staging the changes, and then clicking on the tick-mark symbol (highlighted in a red box), voila! I have committed my code.
In-built Python Linter
Won’t spend too much time here but if you don’t use linters yet, do give them a shot. All the annoying bugs in our code (a typo in our variable names, calling functions we have not imported yet etc. ) can be easily debugged with a linter.
This linter will highlight the first bug it catches a per the control flow. Here as you see, when I call the subtractor function, NameError is caught.
For longer pieces of code, if you only want to debug a certain section, we can use breakpoints (red dots below) to define which section of code are of interest.
Integrated Jupyter Notebooks
Using Jupyter Notebooks in VS code is very easy. All we need to do is open and save a new file with the .ipynb extension. Thats it!
Once we have the notebook open, on the top right of the page we can choose which kernel we would like. Here I am using the same kernel we setup earlier (comfortable_ds). However, if you want to use another virtual environment, click on the kernel button and choose your desired virtual environment directory path from the drop down.
I’m sure some people reading this article are thinking, “Psh.. Most of these things can be done in JupyterLab, why do I need to use VS code?” To you, I offer my final trump card.
For those of us who need to experiment and test code in large remote servers because of the sheer size of data, an annoying situation is reached. It can be really annoying going back and forth between our local environment where we would like to edit our code and the remote server, where we want to run our code. We either have to constantly git push and pull code or for those brave souls, use vim to write code directly on the server. Frightening. Wouldn’t it be ever so comfortable to be able to write and run code remotely in the VS Code? It would and it is.
To set this up, we first need to provide our instance details. Since I am comfortable with AWS, I will be using some AWS terminology. First, we click on the orange button on the bottom left corner, then choose “Connect to Host”.
Then we select “Configure SSH Host”. Here we need to have a config file which has our instance details where we provide the instance name, address and the location of the .pem file.
Once we have this file, we need to specify the path to the config file under “Remote.SSH: Config file”.
Now we can SSH using “Connect to a Host”, VS code will try SSHing to the server and the files on the server should be visible. Something like this.
For a more detailed setup, please checkout https://medium.com/@christyjacob4/using-vscode-remotely-on-an-ec2-instance-7822c4032cff.
With our virtual environments ready to go and an editor which is more customizable than our own home, we are truly comfortable, at least to me.
Once we are nice and cozy, we can go ahead and make our data pipelines, models etc. But even building and storing models can be very repetitive and annoying and some of us can really do with more automation and scheduling. Stay Tuned!