Julia: The new contender in Data Science

  • Rishab Krishanmurthy
  • Oct 01, 2020
  • Machine Learning
Julia: The new contender in Data Science title banner

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning, and big data.

 

What use do programming languages have in Data Science? Data Science usually involves the manipulating and processing of large amounts of data, which is very difficult to do manually. Programming languages provide tools, modules, and algorithms that help facilitate this. They also allow for visualization of the processed data, which is useful in drawing inferences and analyzing it.

 

The two major languages that are in use right now are Python and the R programming language. Python is one of the most widely used general-purpose programming languages according to the TIOBE index running only behind C and Java, as of this writing.

 

Enter Julia. Julia is a general-purpose programming language built for numerical analysis and computational science. It has the simple syntax of Python but runs at speeds comparable to C. 

 

In this article, we will be exploring whether Julia will unseat the two languages, or whether it will occupy only a specific niche in the data scientists toolkit.

 

Status Quo: Why are the languages Python and R so widely used?

 

Part One: The R programming Language

 

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment. The S language and environment were developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. It is accurate to say that R is a different implementation of S. There are some differences, but much code written for S runs unaltered under R.

 

Why is R good for Data Science?

 

R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes

  • an effective data handling and storage facility,

  • a suite of operators for calculations on arrays, in particular matrices,

  • a coherent, integrated collection of intermediate tools for data analysis,

  • graphical facilities for data analysis and display either on-screen or on hardcopy, and

  • a well-developed and simple programming language that includes conditionals, loops, user-defined recursive functions, and input and output facilities.

 

R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

 

Academics use R for statistical analysis quite often, which means that there exists a well-equipped and intelligent community that facilitates learning and helps grow the language.

 

R has many packages that facilitate data wrangling(cleaning complex data sets to enable convenient consumption and further analysis). Some examples of these packages are:

 

  1. dplyr

  2. data.table

  3. readr

It also provides packages for data visualization.

 

The combination of the above factors makes R a suitable language for Data Scientists and Analysts.

 

Drawbacks:

 

  1. Memory requirements:

    1. R utilizes more memory as compared with Python.

    2. R requires the entire data in one single place, that is, in the memory.

    3. Data management packages and integration with Hadoop to handle big data sets.

  2. Lack of security:

    1. Lacks basic security mechanisms and thus cannot be embedded in web applications

  3. Steep learning curve

  4. Lower Speed compared to Python and even MATLAB

 

Part Two: The Python Programming Language

 

Python is an interpreted, high-level, and general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write coherent and logical code for small and large-scale projects.

 

Why is Python good for data science?

 

  1. Easy syntax:

    1. Python has long been known as a simple programming language to pick up, from a syntax point of view.

    2. Compared to languages like C, C++, or Java, The python programming language

  2. A large number of libraries:

    1. Python has a vast selection of libraries that perform various functions

    2. These allow for a large amount of data science operations

    3. The packages get updated constantly

    4. New packages get introduced into the package ecosystem constantly

  3. Active community

    1. Python also has a large group of people who maintain the language, in addition to a vast selection of libraries and resources. 

    2. This community also ensures that new packages get introduced into the package ecosystem.

    3. The communities constant maintenance results in a programming platform that makes sense to use with emerging technologies like machine learning and data science

  4. Portability:

    1. Projects written in Python are easily portable to other languages such as C and Java

    2. Implementations of the language such as Cython allow for the augmentation of existing code

  5. General-purpose programming language:

    1. Being a general-purpose programming language also gives Python an edge over the rest of the languages

    2. Python has useful libraries that fit any task that come up during the manipulation and processing of data

    3. Python is a swiss-army knife able to occupy any niche or function it needs to fulfill

 

Drawbacks:

 

  1. Speed limitations:

    1. Python is an interpreted language, which means that python code gets executed line-by-line, rather than compiled execution

    2. Interpreted execution slows the execution speed of python code 

    3. Low execution speed is especially a problem when dealing with large data sets

  2. Error-handling:

    1. Python indicates errors only when encountered, instead of getting information about all errors at once

    2. Throwing each error one by one significantly increases the error-handling time and may eat into the time required for testing


Image showcases the three programming languages being compared: Python, R and Julia

Julia, Python, and R: The three languages being analysed


The Contender: Julia

 

Julia is a flexible, dynamic language, appropriate for scientific and numerical computing, with performance comparable to traditional statically-typed languages.  

 

Why is Julia Good for Data Science?

 

  1. Julia is compiled, not interpreted:

    • For faster runtime performance, Julia is just-in-time (JIT) compiled using the LLVM compiler framework. At its best, Julia can approach or match the rate of execution of C.

  2. Julia is interactive

    • Julia includes a REPL (read-eval-print loop), or interactive command line, similar to what Python offers. 

    • It also facilitates the creation of quick one-off scripts and commands.

  3. Julia has a straightforward syntax: 

    • The Julia programming syntax is similar to Python—terse but also expressive and powerful.

  4. Julia combines the benefits of dynamic typing and static typing:

    • You can specify types for variables, like an unsigned 32-bit integer. 

    • But you can also create hierarchies of types to allow general cases for handling variables of specific types—for instance, to write a function that accepts integers without specifying the length or signing of the integer. 

    •  If not needed in a particular context, you can even do it without typing entirely.

  5. Julia can call Python, C, and Fortran libraries. 

    • Julia can interface directly with external libraries written in C and Fortran

    • It is also possible to interface with Python code by way of the PyCall library and even share data between Python and Julia.

  6. Julia supports metaprogramming 

    • Julia programs can generate other Julia programs, and even modify their code, in a way that is reminiscent of languages like Lisp.

  7. Julia has a full-featured debugger: 

    • Julia 1.1 introduced a debugging suite, which executes code in a local REPL and allows you to step through the results, inspect variables, and add breakpoints in code. 

    • You can even perform fine-grained tasks like stepping through a function generated by code.

  8. Other benefits are:

    • The core language imposes very little; 

    • Julia Base and the standard library use the Julia language, including primitive operations like integer arithmetic.

    • A rich language of types for constructing and describing objects which can make type declarations

    • The ability to define function behavior across many combinations of argument types via multiple dispatches

    • Automatic generation of efficient, specialized code for different argument type

 

Drawbacks:

 

  1. Immature:

    1. The Julia language has nowhere near the number of packages nor a large community, compared to Python

    2. As a result of this immaturity, Julia's packages are still very much in the developmental stage, with updates being few and far between

  2. 1-array indexing and portability:

    1. Arrays start at one, and code is not as portable.

 

 

Conclusion

 

Julia is an interesting and fresh language. While it may not be able to unseat the current programming stalwarts like Python and R, it has a dedicated community and new users every day that help it grow.

 

On our blog, you can check out more articles about programming in R and Python.

0%

Comments