Current location - Loan Platform Complete Network - Big data management - Learn python or java for big data
Learn python or java for big data

Today we are going to give you some information about how to choose a programming language for big data?

First of all let's say there's a big data project and you know the problemdomain, and you know what infrastructure to use, and you may have even decided which framework to use to process all that data, but there's one decision that's slow in coming:

Which language should I choose? (Or perhaps the more pertinent question is, which language should I force all my developers and data scientists to use? This question won't be postponed for long, and sooner or later it will have to be decided.

How to Choose a Programming Language for Big Data

Of course, there's nothing stopping you from using other mechanisms (such as XSLT transformations) to work with big data. But generally speaking, there are multiple languages available today for Big Data, such as Java, Python, R, and Scala.So which language should you choose? Why or when should you choose it? Below we present the two languages Python and Java as they are.

Python

If your data scientists don't use R, they're probably thoroughly familiar with Python, which has been popular amongst academics for more than a decade, especially in areas such as natural language processing (NLP). Consequently, if you have a project that requires NLP processing, you're faced with a dizzying number of choices, including the classic NTLK, topic modeling using GenSim, or the super-fast and accurate spaCy. Similarly, when it comes to neural networks, Python is equally at home, with Theano and Tensorflow; and then there's the machine learning-oriented scikit-learn, and NumPy and Pandas for data analytics.

There's also Juypter/iPython - a web-based laptop server framework that lets you use a ****-enjoyable logging format that mixes code, graphics, and almost any object into the mix. This has always been one of Python's killer features, but these days the concept is proving so useful that it's showing up on pretty much every language that espouses the read-read-output-loop (REPL) concept, including Scala and R.

Python tends to be supported in big-data-processing frameworks, but at the same time it tends not to be a "first-class citizen." New features in Spark, for example, almost always appear at the top of the Scala/Java bundle, and it may be necessary to write several minor versions in PySpark geared toward those updates (this is especially true for development tools on the SparkStreaming/MLLib side of things).

Java

In the end, it's always going to be Java -- a language that's unloved, abandoned, owned by a company that only seems to care about it when there's money to be made by suing Google (note: Oracle), and completely unfashionable. Only drones in the corporate world use Java! However, Java might be a good fit for your big data project. Think about HadoopMapReduce, which is written in Java.What about HDFS? Also written in Java. Even Storm, Kafka and Spark can run on the JVM (using Clojure and Scala), which means that Java is a "first class citizen" in these projects. There are also newer technologies like GoogleCloudDataflow (now ApacheBeam), which until recently only supported Java.

Java may not be the rock-star favorite language of choice. But as developers struggle to make sense of the set of callbacks in Node.js apps, using Java gives you access to a vast ecosystem (including analyzers, debuggers, monitoring tools, and libraries to ensure enterprise security and interoperability), and much more besides, most of which has been tried and true for the last two decades (sadly, Java turns 21 this year. we're all getting old).

One of the main reasons for blasting Java is that it's very cumbersome and lengthy, and lacks the REPL needed for interactive development (which R, Python, and Scala all have). I've seen 10 lines of Scala-based Spark code quickly turn into a sick 200 lines of code written in Java, along with huge type statements that take up most of the screen space. However, the new Lambda support in Java 8 goes a long way toward improving this situation; Java was never as compact as Scala, but Java 8 does make developing in Java less painful.

Which language should you use for big data projects? I'm afraid it depends on the situation. If you're doing NLP or intensive neural network processing across GPUs, Python is a good choice. If you want a hardened, production-oriented data streaming solution that has all the important operational tools, Java is an excellent choice.

Recommended course: python basic syntax fully explained video (MaGo Education 2014 edition)