What is the Best Programming Language for Machine Learning?
Q&A sites and data science forums are buzzing with the same questions over and over again: I’m new to data science, what language should I learn? What’s the best language for machine learning?
Firstly, my preference is to go with enterprise strength machine learning code generating software - if one has access to it. There are serious advantages of doing it this way, starting with speed, convenience, efficiency. Once user reaches limitation of interface - then it makes sense to continue programmatic route. And at that point - question is - which language to use?
There’s an abundance of articles attempting to answer these questions, either based on personal experience or based on job offer data. We turned instead to our hard data from 2,000+ data scientists and machine learning developers who responded to our latest survey about which languages they use and what projects they’re working on – along with many other interesting things about their machine learning activities and training. Then, being data scientists ourselves, we couldn’t help but run a few models to see which are the most important factors that are correlated to language selection. We compared the top-5 languages and the results prove that there is no simple answer to the “which language?” question. It depends on what you’re trying to build, what your background is and why you got involved in machine learning in the first place.
Which machine learning language is the most popular overall?
First, let’s look at the overall popularity of machine learning languages. Python leads the pack, with 57% of data scientists and machine learning developers using it and 33% prioritising it for development. Little wonder, given all the evolution in the deep learning Python frameworks over the past two years, including the release of TensorFlow and a wide selection of other libraries.
Python is often compared to R, but they are nowhere near comparable regarding popularity: R comes fourth in overall usage (31%) and fifth in prioritisation (5%). R is, in fact, the language with the lowest prioritisation-to-usage ratio among the five, with only 17% of developers who use it prioritising it. This means that in most cases R is a complementary language, not a first choice. The same ratio for Python is at 58%, the highest by far among the five languages, a clear indication that the usage trends of Python are the exact opposite to those of R. Not only is Python the most widely used language, it is also the primary choice for the majority of its users.
Python is prioritised in applications where Java is not
Our data reveals that the most decisive factor when selecting a language for machine learning is the type of project you’ll be working on – your application area. In our survey, we asked developers about 17 different application areas while also providing our respondents with the opportunity to tell us that they’re still exploring options, not actively working on any area. Here we present the top and bottom three areas per language: the ones where developers prioritise each language the most and the least.
Artificial Intelligence (AI) in games (29%) and robot locomotion (27%) are the two areas where C/C++ is favoured the most, given the level of control, high performance and efficiency required. Here a lower level programming language such as C/C++ that comes with highly sophisticated AI libraries is a natural choice, while R, designed for statistical analysis and visualisations, is deemed mostly irrelevant. AI in games (3%) and robot locomotion(1%) are the two areas where R is prioritised the least, followed by speech recognition where the case is similar.
Professional background is pivotal in selecting a machine learning language
Second to the application area, the professional background is also pivotal in selecting a machine learning language: the developers prioritising the top-five languages more than others come from five different backgrounds. Python is prioritised the most by those for whom data science is the first profession or field of study (38%). This indicates that Python has by now become an integral part of data science – it has evolved into the native language of data scientists. The same can not be said for R, which is mostly prioritised by data analysts and statisticians (14%), as the language was initially created for them, replacing S.
For Java, it’s the front-end desktop application developers who prioritise it more than others (21%), which is also in line with its use mostly in enterprise-focused applications as noted earlier. Enterprise developers tend to use Java in all projects, including machine learning. The company directive, in this case, is also evident from the third factor that is strongly correlated to language prioritisation – the reason to get into machine learning. Java is prioritised the most (27%) by developers who got into machine learning because their boss or company asked them to. It is the least preferred (14%) by those who got into the field just because they were curious to see what all the fuss was about – Java is not a language that you normally learn just for fun! It is Python that the curious prioritise more than others (38%), another indication that Python is recognised as the main language that one needs to experiment with to find out what machine learning is all about.
It seems that some universities teaching data science courses, still need to catch up with this notion though. Developers who say that they got into machine learning because data science is/was part of their university degree are the least likely to prioritise Python (26%) and the most likely to prioritise R (7%) as compared to others. There is evidently still a favourable bias towards R within statistics circles in academia – where it was born – but as data science and machine learning gravitate more towards computing, the trend is fading away. Those with university training in data science may favour it more than others, but in absolute terms it’s still only a small fraction of that group too that will go for R first.
C/C++ is prioritised more by those who want to enhance their existing apps/projects with machine learning (20%) and less by those who hope to build new highly competitive apps based on machine learning (14%). This pattern points again to C/C++ being mostly used in engineering projects and IoT or AR/VR apps, most likely already written in C/C++, to which ML-supported functionality is being added. When building a new app from scratch – especially one using NLP for chatbots – there’s no particular reason to use C/C++, while there are plenty of reasons to opt for languages that offer highly-specialised libraries, such as Python. These languages can more quickly and easily yield highly-performing algorithms that may offer a competitive advantage in new ML-centric apps.
There is no such thing as a ‘best language for machine learning’
If your first ever contact with programming is through machine learning, then your peers in our survey point to Python as the best option, given its wealth of libraries and ease of use. If, on the other hand, you’re dreaming of a job in an enterprise environment, be prepared to use Java. Whatever the case, these are exciting times for machine learning and the journey is guaranteed to be a mind-blowing one, irrespective of the language you opt for. Enjoy the ride!