Well, What is a cloud platform?
A platform where you write into and you dont know where the data is going and where exactly is it getting stored. Its so dynamic that even the creaters of cloud wont be able to predict where a particular data is. The storage space increases according to need. For example, if i load a few MBs of data it will allocate certain amount of space for me, say tomorrow my company grows and i need 100GB of data it will easily allocate for me from some space called cloud. Its called cloud because nothing is specific.
What is hadoop?
Hadoop is a platform that enables us to acquire those functionality of cloud architecture? Well not exactly, although its frequently compared with cloud but its not quite complete cloud. It does provide a way to store huge data files. Data files can be tore apart and stored in many computers. So that seems simple. Good thing is that you can now operate on those data in whereever they are stored. That means you feed a large file say 100GB and you forget about where it is stored and how to integrate it to operate upon.
All you need to do is provide a processing program to hadoop and this data file and it will process and give out the result.
That looks like distributed computing. Yes, indeed it is more close to distributed computing. But remember the computing program is fixed size and its never split up. In hadoop its data (say 100GB) which is split up into say 500MB size and operated upon. Each processing having some redundancy.
How hadoop works?
I am going to explain in layman's language.
1) You feed in a big file
2) Your file gets splitted into small parts and sent to different computer (nodes) for storage. There is multiple redundancy.
3) Then we send the program in.
4) The program is distributed to all the nodes and operate on the data.
5) The data is then combined into sequence and given out.
So there are two main controllers:
1) Storage data manager.
2) Program manager.
If these two nodes start up first, then several worker nodes can be started and plugged into system when they are running. Pluggin and unplugging is as easy as just taking out the LAN cable and putting it back again.
Two other components are important for hadoop to work. Those are splitter and combiner. Splitter: splits the data into small pieces and send to nodes. In fact there is multiple redundancy so five pieces of data can be sent to 10 computers, and which ever computes first will give back the result fast, and that result will be accepted.
Combiner: has all the knowledge to combine all the results and verify which results are good. It will pick up the fastest available and correct data from the nodes and ignore all successive results. Finally once all the parts are gathered it will give out the result.
I wish to see a lot of comments and clarifications on this topic. Lets learn about cloud and hadoop as it is in the process of making.
Hadoop is currently used in:
a) Yahoo's 2000 computing nodes management.
b) Hadoop has been successfully been used in image processing and indexing. (NY times)
c) Well image processing in all the frames of video, and creating an index out of it.
d) Its used in some form in Microsoft's search engine(Bing)
No comments:
Post a Comment