Along your journey into Data Science & AI, you’ve probably heard people talking about ‘the cloud’, ‘AWS, GCP, Azure’ all of these interesting-sounding platforms, but why should an AI Specialist take any notice? There are a number of ways that the modern data science environment is working with the cloud, in a lot of organisations the primary workflow for development and deployment is based solely in the cloud. For others, they still work on local machines for development and various places for deployment (both on-premise and cloud, which is usually called hybrid deployment). So for someone starting in data science, what should you focus on learning?
There are a number of key things to consider when evaluating if you should incorporate the cloud into your workflow, aspects such as scale, data restrictions, and the ability to collaborate all of which affect the way you work. Let’s dive into some of these aspects and have a look at why they matter.
The main reason both individual data scientists and companies decide to use cloud-based workflows is the ability to scale far beyond what would be possible with on-premise systems
Why do I need scale?
First of all, scale is the ability of the system you are working on to deal with increasing compute requirements, which happens when running large or dense models (this link has a great write up on scale). The main reason both individual data scientists and companies decide to use cloud-based workflows is the ability to scale far beyond what would be possible with on-premise systems. Take for example if you were training a large neural network, and to get the results you need it would take around 10 hours to train on a machine with 64 cores, 488Gb of ram and 8 GPUs. A machine like this would cost you tens of thousands of dollars to buy and run locally, and you would also need the physical space to do it as machines like this take up quite a sizeable footprint. You can access a machine that size in a few clicks on the cloud, use it for the 10 hours to train your model and then shutting it down again. A machine of that specification would cost you around $250 USD for those 10 hours, making training and larger models more achievable for data scientists without access to large local machines (and big piles of cash!).
This example shows that depending on the goals you have for the model you are producing, the need for massive compute (or more compute than you have locally) may be there, which cloud offers an effortless and cheap way to do that as soon as you need it.
So when should I use on-premise?
Using the previous example (admittedly quite an extreme one!), if you needed to run that neural network every day for a model to retrain, it would cost you upwards of $90,000 USD to run every day for a year. Now you can see where owning that machine locally may be more cost-effective when you need it for a set, continuously computationally heavy task.
Another important thing that you need to think about is the type of data you are using to feed your model, some questions that will help you understand if it can go to the cloud are:
- Does this data have any access restrictions?
- Is this data permitted for use outside of my local environment?
- Is my company’s cloud secured for this type of data?
Data scientists are often provided CSV’s and other file-based formats for work on smaller data sets, but just because the data has been provided to you in a file, doesn’t mean that it is approved by your organisation for use outside of your protected local machine. The best thing to do is air on the side of caution and always ask your database administrators or security team if the data is approved for use on the cloud. If you are working on a personal project, the securities in place by the cloud providers should be more than adequate, as long as you secure access with a private key.
What about collaboration?
There are many ways to collaborate in the realm of data science, you can share finished code with source control (Github etc), or you can pair program with other data scientists using sharing notebooks (Google Collab etc). Whichever way you chose to share, they generally include the ability to run code on the cloud and share it with others. Considering the tools you learn is important as you learn to collaborate in this space, for example, sites like Kaggle allow you to run your code on the website using their data sets, which is powered by cloud compute.
In summary, using the cloud has a number of advantages and allows you to work quickly with scale and accelerate your training, however, there are a number of things you need to be aware of when considering moving data to the cloud and picking the right type of infrastructure for the right type of task. In the next article in this series, we will look at where to start with understanding the cloud for AI specialists and what you can get started on to start building your cloud experience.
Check out the next article in this series:
And check out the previous series on the various ways to transition into data science:
- Transitioning and Changing Careers – Getting into Data Science & AI
- University and Formal Study – Getting into Data Science & AI
- Courses, Bootcamps, and Self Study – Getting into Data Science & AI
Jeremiah is an Director at PwC leading a Data Advisory team and founder AI Specialist Blog. He has received the ACS ICT Professional of the Year (2019), Top 25 Analytics Professionals Australia (2021, 2018). He has written articles for the AFR, IBM, and LearnDataSci.
Please get in touch if your business needs any help in the Data Science & Strategic advisory space!