как установить apache spark на windows 10
Apache Spark Installation on Windows
In this article, I will explain step-by-step how to do Apache Spark Installation on windows os 7, 10, and the latest version and also explains how to start a history server and monitor your jobs using Web UI.
Install Java 8 or Later
To install Apache Spark on windows, you would need Java 8 or later version hence download the Java version from Oracle and install it on your system. If you wanted OpenJDK you can download it from here.
Note: This article explains Installing Apache Spark on Java 8, same steps will also work for Java 11 and 13 versions.
Apache Spark Installation on Windows
Apache Spark comes in a compressed tar/zip files hence installation on windows is not much of a deal as you just need to download and untar the file. Download Apache spark by accessing the Spark Download page and select the link from “Download Spark (point 3 from below screenshot)”.
If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down; the link on point 3 changes to the selected version and provides you with an updated link to download.
After download, untar the binary using 7zip or any zip utility to extract the zip file and copy the extracted directory spark-3.0.0-bin-hadoop2.7 to c:\apps\opt\spark-3.0.0-bin-hadoop2.7
Spark Environment Variables
Follow the below steps if you are not aware of how to add or edit environment variables on windows.
3. This opens up the New User Variables window where you can enter the variable name and value.
4. Now Edit the PATH variable
5. Add Spark, Java, and Hadoop bin location by selecting New option.
Spark with winutils.exe on Windows
Many beginners think Apache Spark needs a Hadoop cluster installed to run but that’s not true, Spark can run on AWS by using S3, Azure by using blob storage without Hadoop and HDFSe.t.c.
To run Apache Spark on windows, you need winutils.exe as it uses POSIX like file access operations in windows using windows API.
winutils.exe enables Spark to use Windows-specific services including running shell commands on a windows environment.
Download wunutils.exe for Hadoop 2.7 and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version based on your Spark vs Hadoop distribution from https://github.com/steveloughran/winutils
Apache Spark shell
spark-shell is a CLI utility that comes with Apache Spark distribution, open command prompt, go to cd %SPARK_HOME%/bin and type spark-shell command to run Apache Spark shell. You should see something like below (ignore the error you see at the end).
Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.
On spark-shell command line, you can run any Spark statements like creating an RDD, getting Spark version e.t.c
This completes the installation of Apache Spark on Windows 7, 10, and any latest.
Where to go Next?
You can continue following the below document to see how you can debug the logs using Spark Web UI and enable the Spark history server or follow the links as next steps
Web UI on Windows
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.
Spark Web UI
History Server
After setting the above properties, start the history server by starting the below command.
By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/
Spark History Server
By clicking on each App ID, you will get the details of the application in Spark web UI.
Conclusion
If you have any issues, setting up, please message me in the comments section, I will try to respond with the solution.
How to install Apache Spark on Windows?
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
In this document, we will cover the installation procedure of Apache Spark on Windows 10 operating system
Prerequisites
This guide assumes that you are using Windows 10 and the user had admin permissions.
System requirements:
Installation Procedure
Step 1: Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.
The page will look like below.
Step 2: Once the download is completed unzip the file, to unzip the file using WinZip or WinRAR or 7-ZIP.
Step 3: Create a folder called Spark under your user Directory like below and copy paste the content from the unzipped file.
It looks like below after copy-pasting into the Spark directory.
Step 4: Go to the conf folder and open log file called, log4j.properties. template. Change INFO to WARN (It can be ERROR to reduce the log). This and next steps are optional.
Remove. template so that Spark can read the file.
Before removing. template all files look like below.
After removing. template extension, files will look like below
Step 5: Now we need to configure path.
Add below new user variable (or System variable) (To add new user variable click on New button under User variable for )
Add %SPARK_HOME%\bin to the path variable.
Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.
You can find winutils.exe from below page
Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.
Add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.
Step 8 : To install Apache Spark, Java should be installed on your computer. If you don’t have java installed in your system. Please follow the below process
Java Installation Steps :
Accept Licence Agreement for Java SE Development Kit 8u201
Test Java Installation :
You should also check JAVA_HOME and path of %JAVA_HOME%\bin included in user variables (or system variables)
1. In the end, the environment variables have 3 new paths (if you need to add Java path, otherwise SPARK_HOME and HADOOP_HOME).
2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.
Test Installation :
Open command line and type spark-shell, you get the result as below.
We have completed spark installation on Windows system. Let’s create RDD and Data frame
We create one RDD and Data frame then will end up.
1. We can create RDD in 3 ways, we will use one way to create RDD.
Define any list then parallelize it. It will create RDD. Below is code and copy paste it one by one on the command line.
Above will create RDD.
2. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.
Above code will create Dataframe with id as a column.
To display the data in Dataframe use below command.
It will display the below output.
How to uninstall Spark from Windows 10 System:
Please follow below steps to uninstall spark on Windows 10.
To remove System/User variables please follow below steps:
Open Command Prompt the type spark-shell then enter, now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.
Ravichandra Reddy Maramreddy
Ravichandra is a developer and specialized in Spark and Hadoop Ecosystems, HDFS and MapReduce which includes estimations, requirement analysis, design development, coordination, validation in-depth understanding of game design practices. Having extensive experience in Spark, Spark Streaming, Pyspark, Scala, Shell, Oozie, Hive, HBase, Hue, Java, SparkSQL, Kafka, WSO2. Having extensive experience in using Data structures and algorithms.
Join the Discussion
Your email address will not be published. Required fields are marked *
8 comments
complete guidance about Apache Spark that how to installed. thanks a lot for that. very helpful.
thanks a lot..good and complete setup 🙂
Thank you. This was very helpful.
thank you very much, you are awesome
Thanks, man! Awesome guide. Helped a lot
This is really very helpful how to install guide. Thanks Reddy! Great job!
Thank you. This document helped me alot. Cheers
This tutorial is very very useful. Many thanks Regards
Trending blog posts
Suggested Blogs
How To Become a Data Analyst
By Priyankur Sarkar
Overview of Deploying Machine Learning Models
By KnowledgeHut
Machine Learning is no longer just the latest buzz. Read More
Как установить apache spark на windows 10
This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.
A few words about Apache Spark
Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).
Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.
Install or update Java
For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:
Start –> All apps –> Java –> Check For Updates
In the same way you can verify your java version. This is the version I used:
Download Scala
Download from here. Then execute the installer.
I just downloaded the binaries for my system:
Download Spark
Select any of the prebuilt version from here
As we are not going to use Hadoop it make no difference the version you choose. I downloaded the following one:
Feel free also to download the source code and make your own build if you feel comfortable with it.
Extract the files to any location in your drive with enough permissions for your user.
Download winutils.exe
This was the critical point for me, because I downloaded one version and did not work until I realized that there are 64-bits and 32-bits versions of this file. Here you can find them accordingly:
In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.
Git client download (I hope you don’t get stuck in this step)
Extract the folder containing the file winutils.exe to any location of your preference.
Environment Variables Configuration
This is also crucial in order to run some commands without problems using the command prompt.
Environment Variables 1/2
Environment Variables 2/2
Permissions for the folder tmp/hive
I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.
I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:
Open a command prompt as administrator and type:
Set 777 permissions for tmp/hive
Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.
We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.
Using the Scala Shell to run our first example
In the same command prompt go to the Spark folder and type the following command to run the Scala shell:
Start the Spark Scala Shell
After some executions line you should be able to see a similar screen:
You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.
Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:
Running a Spark Example
And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.
References
Share this:
Like this:
About Paul Hernandez
55 Responses to Apache Spark installation on Windows 10
Hi Paul
This is a great help to me, but seems I’m doing something wrong.
I have Windows 10 Pro 64 bits. I downloaded the winutils.exe (64 bits) but when I tried to execute the:
C:\WINDOWS\system32>c:\Hadoop\bin\winutils.exe chmod 777 \tmp\hive
I obtain an error (the winutils.exe is not compatible with the Windows version
“Esta versión de c:\Hadoop\bin\winutils.exe no es compatible con la versión de Windows que está ejecutando. Compruebe la información de sistema del equipo y después póngase en contacto con el editor de software.”
Must I download all the folder where the winutils.exe is?
Do you have any idea what’s I’m doing bad?
Hi Paul,
The winutils issue was my headache. Please try to do the following:
– Copy the content of the whole library and try again.
– If this doesn’t help, try to build the hadoop sources by yourself, I wrote a post about it (https://wordpress.com/stats/day/hernandezpaul.wordpress.com). It was also a pain in the a…
– If you don’t want to walk this way just let me know and I will share a link to downlod the winutils I built. I did it with Windows Server 64 bits but it should work also for Windows 10.
– Last thing I can offer to you is download the hadoop binaries that this blogger offers in this post: http://kplitzkahran.blogspot.de/2015/08/hadoop-271-for-windows-10-binary-build.html
the download link is at the very end of the post.
Kind Regards,
Paul
I downloaded whole the library and seems fine!
Only appears some warning, but the program now is running.
Thank you very much!!
How do you set environment when you download the whole library?
How do you configure your environment variables? I am facing some problems here.
I was hit by the same error hidden in many lines of warnings, exceptions, etc. Your post saved my day. Thank You.
Thanks a ton for this amazing post. However I am facing a problem which I cannot resolve. It would be great if you could help me out with it.
In the example I tried using val textFile = sc.textFile(“”C:/Users/…./spark-2.1.0-bin-hadoop2.7/spark-2.1.0-bin-hadoop2.7/spark-2.1.0-bin-hadoop2.7/README.md”)
I am getting an error “error: not found: value sc”
I also tried without the complete path i.e. just (“README.md) and the same error is featuring.
Hi Joyishu,
please open a command shell and navigate to the spark directory (i.e. cd spark1_6). This is the directory where the README.md is located and also the bin folder.
Start the Scala Shell without leaving this directory. You can do that by typing bin\spark-shell
Once the shell has started you are still in this working directory and just need to type val textFile = sc.textFile(“README.md”)
You don’t need to specify the complete location because the file is located in your working directory
Please note in your example above that there is a typing error in your line with the complete path, there are 2 opening quotation marks (“”C:/Users).
If you are using windows you should also consider backslashes “\” for the path definitions.
Last but not least you can find more information about the textFile funciton here: Spark Programming Guide
I did exactly the same but the error still persists. If you could please share your email, I can mail you the screenshot. Thanks again for all the help.
Hi Joyishu, add me to linkedin, then we can communicate with eachother: https://de.linkedin.com/in/paulhernandezplayingwithbi
Maybe this will help:
The winutils should explicitly be inside a bin folder inside the Hadoop Home folder. In my case HADOOP_HOME points to C:\tools\WinUtils and all the binaries are inside C:\tools\WinUtils\bin
Great tutorial.
Three additions:
You don’t need Git installed to download a repository (you might not even need an account on github, but I’m not sure about the latter)
In any case, you can download the repository as a Zip from root of the repository (https://github.com/steveloughran/winutils) Select “Clone or download” > “Download ZIP”
Secondly, Spark should be installed in a folder path containing *no spaces*, so don’t install it in “Program Files”.
Thirdly, the winutils should explicitly be inside a bin folder inside the Hadoop Home folder. In my case HADOOP_HOME points to C:\tools\WinUtils and all the binaries are inside C:\tools\WinUtils\bin
(Maybe this is also the problem @joyishu is suffering from, because I got the exact same error before fixing this)
Many thanks for the contribution ☺
Secondly, Spark should be installed in a folder path containing *no spaces*, so don’t install it in “Program Files”.
Hi Vishal,
white spaces cause errors when the application try to build path from system or internal variables. I cannot tell you what is the impact for spark but the typical example is the content of the JAVA_HOME environment variable. At least for this case you can use the following notation to overcome the problem:
Progra
1 = ‘Program Files’
Progra
2 = ‘Program Files(x86)’
Please have a look here: https://confluence.atlassian.com/doc/setting-the-java_home-variable-in-windows-8895.html
For other cases I cannot tell you. Try to discuss it with your system administrator or you may use another drive, i.e. “D:\”.
Best regards, Paul
Hi Paul,
Thanks it resolved the problem now. but I am steel struggling to get out of the following error. see if you can help me out!!
Thanks in advance 🙂
” invalid loc header (bad signature) scala”
This is the error for me.
Hey Great tutorial.. Thanks for your help
Thanks! The post was helpful to me in troubleshooting my setup, especially running the winutils part to setup hive directory permissions.
Nice. It resolved all the errors that most of the first-time developer encounters..
The blog has helped me lot with all installation while errors occurred but still facing an problem while installing spark on windows while launching spark-shell.
can anybody please help with solution for as soon as possible? thanks in advance.
VARIABLES:
JAVA_HOME:C:\Program Files\Java\jdk1.8.0_131
SBT_HOME:C:\Program Files (x86)\sbt\
SCALA_HOME:C:\Program Files (x86)\scala\bin
SPARK_HOME:C:\spark-2.2.0\bin
HADOOP_HOME:C:\hadoop-master
Path:C:\Program Files (x86)\scala\bin;C:\Program Files (x86)\sbt\bin;C:\spark-2.2.0\bin;C:\hadoop-master\bin;
17/08/23 21:34:47 WARN General: Plugin (Bundle) “org.datanucleus.api.jdo” is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL “file:/C:/spark-2.2/jars/datanucleus-api-jdo-3.2.6.jar” is already registered, and you are trying to register an identical plugin located at URL “file:/C:/spark-2.2/bin/../jars/datanucleus-api-jdo-3.2.6.jar.”
java.lang.IllegalArgumentException: Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionStateBuilder’:
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
… 47 elided
:14: error: not found: value spark
import spark.implicits._
^
:14: error: not found: value spark
import spark.sql
As per my knowledge the build is not successful so that’s why you spark session is not instantiated. you have added the same jar i.e. (datanucleus-api-jdo-3.2.6.jar) multiple times and while building the session the system is confused to take which jar file. You may check the locations mentioned in the error for duplicate jar files and delete one of them.
So it leaves me still stuck at spark-shell initialization for spark context instance creation