This is a Frequently Asked Questions webpage for the Hadoop part of the Programming for Big Data Module.
This page contains questions asked by previous students. The answers given aim to answer these questions and to give some additional pointers.
The questions are broken into sections based on the topics covered during each week, and also includes any questions received about the Assignment.
If no Questions are listed for a week or on a topic, then no one asked any questions.
Submitting new questions: All new questions should be emailed to me. Try to make your question specific to a topic or a particular challenge. Give examples etc to help illustrate your questions. If you come across any materials or resources you would like to share with the class, I can post the details on this page.
Q: Do I need to install anything to run the VM?
You will need to have VirtualBox installed on your laptop/desktop to be able to run the VM. This is a simple install and only takes a few minutes. If this can be done in advance of the first class then it will save you some time.
Q: Do I have to use the supplied VM?
No, you can use any VM that has Hadoop installed on it. An alternative is to use one of the many cloud based Hadoop environments. There are many out there. For example Amazon has a few different versions. Several students have used these in previous years.
Q: What version of the VM should I use?
It is recommended that you use the 64bit version of the VM. Most laptops/desktops nowadays are 64bit, so use the 64bit VM.
You do need to insure that you have virtualization enabled on your laptop/desktop. This may require some changes to the bios settings of your machine.
Q: Do I have to setup shared folders for the VM? Is there an alternative way to share files between my laptop and VM?
No, you can do all the work on the VM. Then you don’t have to move any of the Java files from your laptop/desktop to the VM.
But if you prefer to work with Java on your laptop/desktop then an alternative is to use google drive, or email, etc to share the files between your working environments.
Whatever works best and easiest for you.
Q: The VM when into hibernate mode. When it wakes up it is asking for a username password. What should I enter?
Yes this can happen, and the username and password is in the lab notes for Week 1.
To save you the task of trying to find them, they are listed below. Make sure to use the correct username/password for the version of the VM you are using (32bit vs 64bit)
64bit VM = soctech / ubuntu
32bit VM = Hadoop / hadoopVM
Q: Can I use the latest version of Eclipse, instead of Luna?
Yes, you can use the latest version of Eclipse if you want, but it is configured to use the lastest version of Java. The version of Java on the VM is dependent on the version of Hadoop. This will not be the latest version of Java.
You will need to configure the latest version of Eclipse to work with the version of Java on the VM.
Q: Do I need to be good at programming to take this module?
Yes there is a lot of coding in all components of this module and not just in the Hadoop section.
For the Hadoop section it is ideal that you have some experience of working with Java, but is not mandatory. It comes down to how confident you are at coding in other languages and other IDEs. You may be able to pick up Java and the other languages quickly.
Q: I get the following error when I try to run my first MR jar file
When I enter the line below:
soc@soc-VirtualBox:~$ hadoop jar WordCount.jar WordCount shakespeare/poems myOutput
I just get this line;
Usage: WordCount <input path> <output path>
And nothing actually runs.
There are a couple of things to check. Check that the jar is generated as ‘Runnable Jar File’ then try to run it from the directory where it is located. If that doesn’t work then generate it as a ‘Jar File’ and again run it from the directory where it is located.
Alternatively, try changing the command to the following
hadoop jar WordCount.jar shakespeare/poems myOutput
Q: Can I use another tool or language to prepossess and clean the data files.
NO. Everything needs to be done using MapReduce
Q: How many files should I download
3<=X<=6, where X is the number of files
Q: Can I build separate MapReduce processes to answer each part of the assignment.
No. You should create one chained MapReduce process for this assignment. Each component can create an output. This may contain the answer required, as as well as providing input to the next job of the chained process.
Q: I’m confused between what is a Project and what is a Site
Check out the webpage https://dumps.wikimedia.org/other/pagecounts-raw/ You can download the required files from here and process accordingly.