New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll
Python caught up with R and (barely) overtook it; Deep Learning usage surges to 32%; RapidMiner remains top general Data Science platform; Five languages of Data Science.
The 18th annual KDnuggets Software Poll again got huge participation from analytics and data science community and vendors, attracting about 2,900 voters, almost exactly the same as last year. Here is the initial analysis, with more detailed results to be posted later.
Python, whose usage has been growing faster than R for the last several years, has finally caught up with R, and (barely) overtook it, with 52.6% respondents using it vs 52.1% for R.
The biggest surprise is probably the phenomenal share of Deep Learning tools, now used by 32% of all respondents, while only 18% used DL in 2016 and 9% in 2015. Google Tensorflow rapidly became the leading Deep Learning platform with 20.2% usage, up from only 6.8% in 2016 poll, and entered the top 10 tools.
While in 2014 I wrote about Four main languages for Analytics, Data Mining, Data Science being R, Python, SQL, and SAS, the 5 main languages of Data Science in 2017 appear to be Python, R, SQL, Spark, and Tensorflow.
RapidMiner remains the most popular general platform for data mining/data science, with about 33% usage, almost exactly the same as in 2016.
We note that many vendors have encouraged their users to vote, but all vendors had equal chances, so this does not violate KDnuggets guidelines. We have not seen any bot voting or direct links to vote for only one tool this year.
Spark grew to about 23% and kept its place in top 10 ahead of Hadoop.
Besides TensorFlow, another new tool in the top tier is Anaconda, with 22% usage.
Table 1: Top Analytics/Data Science Tools in 2017 KDnuggets Poll
2017 vs 2016
In this table 2017 % usage is % of voters who used this tool, % change is the change in usage vs 2016 Software Poll, with green and red highlighting changes up and down of 5% or more, and % alone is the percent of voters who used only the reported tool among all voters who used that tool. E.g. 3.3% of R voters reported using only R and nothing else. This year there were 13 tools with 5% or more lone votes.
Average number of tools per respondent was 6.1, almost unchanged from 6.0 in 2016.
Compared to 2016 KDnuggets Analytics/Data Science Poll results, the 2 newcomers in top 11 are Anaconda and Tensorflow.
The participation by region was:
- US/Canada (41.5%),
- Europe (35.5%),
- Asia (10.1%),
- Latin America (6.5%),
- Africa/MidEast (3.8%),
- Australia/NZ (2.7%).
Compared to 2016, we note slightly less participation from Europe, and slightly more from all other regions.
Notable new tools tools in the poll with over 2% usage are Keras (9.5%), PyCharm (9%), Microsoft R Server (4.3%), IBM DSX (3.0%), PyTorch (3.0%), and Teradata (2.4%).
The table below lists the tools that have grown 20% or more in usage and reached at least 2% usage in 2017. Note this includes 5 Deep Learning tools and 4 Microsoft tools.
Table 2: Major Analytics/Data Science Tools with the largest increase in usage
|Microsoft Power BI||4%||10.2%||5.6%|
|SQL on Hadoop tools||42%||10.3%||7.3%|
|Microsoft other ML/Data Science tools||40%||2.2%||1.6%|
|Other Deep Learning Tools||30%||4.8%||3.7%|
|Microsoft Azure Machine Learning||26%||6.4%||5.1%|
Table 3: Major Analytics/Data Science Tools with the largest decline in usage
|Turi (former Dato/GraphLab)||-93%||0.2%||2.4%|
|Hadoop: Open Source Tools||-32%||15.0%||22.1%|
|Other free analytics/data mining tools||-29%||4.8%||6.8%|
Top Analytics/Data Science Tools
Deep Learning Tools
The usage of Deep Learning tools jumped to 32% of all respondents, vs only 18% in 2016 and 9% in 2015.
Google Tensorflow is the dominant platform, displacing the last year leader Theano/Pylearn2.
Top tools are:
- Tensorflow, 20.2% usage
- Keras, 9.5%
- Theano, 5.8%
- Other Deep Learning Tools, 4.8%
- Microsoft CNTK, 3.4%
- Caffe, 3.1%
- PyTorch, 3.0%
- DL4J, 2.2%
- mxnet, 1.8%
- Torch, 1.2%
- Lasagne, 0.9%
Hadoop/Big Data Tools
We have simplified the choices on Hadoop/Spark tools to Hadoop: Commercial/Open Source Tools, SQL on Hadoop, and Spark and they were used by 33% of all respondents. This is slightly lower than 39% in 2016 but more tools were counted as Big Data in 2016. In 2015, 29% used Spark/Hadoop tools.
In 2017 the Big Data tools usage was
- Spark, 22.7%
- Hadoop: Open Source Tools, 15.0%
- SQL on Hadoop tools, 10.3%
- Hadoop: Commercial Tools 7.6%
Python, Java, Unix tools, Scala grew in popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.
Here are the main programming languages sorted by popularity.
- Python, 52.6% usage (was 45.8% in 2016), 15% up
- R language, 52.1% (was 49.0%), 6% up
- SQL, 34.9% (was 35.5%), 2% down
- Java, 13.8% (was 16.8%), 18% down
- Unix shell/awk/gawk, 9.6% (was 10.4%), 7% down
- C/C++, 6.3%, (was 7.3%), 13% down
- Perl, 1.7%, (was 2.3%), 27% down
- Julia, 1.1%, (was 1.1%), no change
Python keeps growing and sucking oxygen from competitors like Julia, which surprisingly did not grow its usage.