Hadoop and Big Data: 2014

Sunday, 26 October 2014

Sense of Big Data

Big Data: What it Means to IT Managers on the Front Lines

Big Data: yet another “game-changer” IT pros must grapple with these days.

Companies like Google and Facebook are demonstrating that a solid data management strategy can make a huge difference to a company’s bottom line. Corporations everywhere are paying attention; C-level executives are increasingly using insights gained from analyzing Big Data to make business decisions. As a result, companies are promoting IT from cost center to partner in strategic data management.

The term Big Data refers to the vast amounts of unstructured data that result from people’s interactions with the Internet, social media and mobile apps. It’s the kind of data that doesn’t fit neatly into rows and columns with clear relationships on which simple queries and reports can be based.

More and more, IT managers on the front lines are actively participating in efforts to extract meaning from the Big Data companies collect and store. Therefore, IT managers would do well to learn all they can about Big Data and what can be done to help their company mold a solid data management strategy.

Making sense of Big Data

Examples of Big Data are videos, images, transactions, web pages, email, social media content, click-stream data, search indexes, sensor data, etc. – a wide variety of raw, semi-structured and unstructured data that can’t be processed and analyzed using traditional processes and tools, like relational databases.

But the term Big Data also refers to the volume and velocity of the data generated today. IBM, in its e-book, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, explains it this way: the interconnectivity of people and things via technology generates data continuously; technology makes it possible to collect a massive amount of data; but, most of this data isn’t relational and can’t be processed by traditional database systems. Moreover, much of it needs to be analyzed in real time. According to this definition, Big Data encompasses data at rest and data in motion.

So it’s no small wonder that Big Data is so unwieldy. The challenge is to formulate the right questions to extract meaning out of terabytes, even petabytes (and some day zettabytes!) of data — data organizations feel compelled to collect and store even though its value is not always immediately known.

For some companies, putting two and two together may be the only thing standing in the way of greatness.

Except making that connection is really hard. It’s expensive and time consuming to use traditional database tools to analyze Big Data, and it’s not always possible – there might be too much data in too many different formats. Plus, there’s a steep learning curve when it comes to Big Data – new tools require a new set of expertise.

Successfully Navigating Big Data

While analysis of Big Data has the potential to provide actionable insight that can generate financial windfalls for companies, if a compelling business case can’t be made to justify the project, it may be doomed from the start, says Jill Dyche, Vice-President of Thought Leadership at DataFlux Corporation, in a recent blog post. Dyche advises companies to think hard about the answers to these five questions when contemplating an investment in Big Data:

1. What are the goals of the project and what does the company want Big Data to help it accomplish?

2. What current resources can the company build on to develop a comprehensive data management strategy?

3. How will the company avoid scope-creep?

4. What are the criteria for success and how will progress be measured along the way?

5. Can the company manage the structural and process changes that will inevitably result?

If the company can answer these questions to its satisfaction, then chances are developing a solid data management strategy to deal with Big Data is worth it.

QUANTUM Global Academy is famous for providing training in Big Data and Hadoop, Cloud Computing, PMP, Six Sigma, ITIL, event management, retail management and logistics & supply chain in Gurgaon, Delhi.

We are here to help for improving company’s performance and productivity. For more information visit our website www.quantumglobal.org or can call us at 01244609530, 08527092037, for further query, please click here.

Tuesday, 7 October 2014

Big data

Walmart started using big data even before the term big data became known in the industry and in 2012 they moved from an experiential 10-node Hadoop cluster to a 250-node Hadoop cluster. At the same time they developed new tools to migrate their existing data on Oracle, Netezza and Greenplum hardware to their own systems.

The objective was to consolidate 10 different websites into one website and store all incoming data in the new Hadoop cluster. Since then they have made big steps in integrating big data into the DNA of Walmart.

Social big data solutions

Many of the big data tools have been developed at the Walmart Labs, which was created after Walmart took over Kosmix in 2011. Some of the products that were developed at Walmart Labs are ‘Social Genome’, ‘ShoppyCat’ and Get on the Shelf.

The Social Genome product allows Walmart to reach customers, or friends of customers, who have mentioned something online to inform them about that exact product and include a discount. In order to do this they combine public data from the web, social data and proprietary data such as customer purchasing data and contact information.

This has resulted in a vast, constantly changing, up-to-date knowledge base with hundreds of millions of entities and relationships. It helps Walmart to better understand the context of what their customers are saying online.

An example mentioned by Walmart Labs shows a woman tweeting regularly about movies. When she tweets “I love Salt”, Walmart is able to understand that she is talking about the movie Salt and not the condiment.

Walmart came across several technical difficulties when developing the Social Genome, among others the quantity and velocity the data pours into their Hadoop clusters. As the regular Map-Reduce/Hadoop framework was not able to cope with the amount and speed the data was coming in, they have developed their own tool called Muppet.

This, now open-source, tool processes the data in real-time over all clusters and can perform several analysis at the same time.

The Shoppycat product that was developed by Walmart is able to recommend suitable products to Facebook users based on the hobbies and interests of their friends. It uses the Social Genome technology among others to help customers with presents for their friends. An interesting aspect of this Facebook App is that Walmart will direct the Facebook users to a different store in case the product is sold out at a nearby Walmart store.

QUANTUM Global Academy is famous for providing training in Hadoop & Big Data, ITIL, PMP, Six Sigma, event management, retail management and logistics & supply chain in Gurgaon, Delhi. We are here to help for improving company’s performance and productivity. For more information visit our website quantumglobal.org/ or can call us at 01244609530, 08527092037, or click on here.

Monday, 11 August 2014

How important is Hadoop?

“Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data.”

Hadoop Market

The above graph clearly shows that the Hadoop market is on an upward trend. This is because organizations have realized the advantage of implementing Hadoop and its various ecosystems. With larger implementation comes greater need for Hadoop professionals.

Challenges Faced when Implementing Hadoop

According to Sand Hill Group’s survey on Hadoop in early October, 2013, the respondents felt that the skill gap and inadequate number of professionals with knowledge in Hadoop is the biggest setback when it comes to Hadoop implementation. Both these issues need to be addressed as early as possible. The only way to do this is through proper Hadoop training, to enable professionals to meet up with the required skill and knowledge.

Need for Hadoop Training:

The following are the reasons as to why one must go for Hadoop training:

New technology and fewer skilled professional:

It is evident from its popularity that the implementation of Hadoop is a total success. Though there is a huge demand for people with Hadoop skill, there are actually fewer people with the right skill, as Hadoop is a relatively new technology. This skill gap can be minimized through Hadoop training.

Increase in number of professionals looking to switch careers:

There are professional from Java, Mainframe, Data Warehouse and testing background who are willing to change their career to Hadoop. This is not limited to professionals in these backgrounds. People are willing to leave their comfort zone and venture out to new and better territory, namely Hadoop, for their career advancement. Hadoop has taken precedence over other technologies owing to its success in its implementation.

Increased Demand for Hadoop Professionals after Hadoop 2.0:

The Apache Software Foundation recently released Hadoop 2.0 that incorporates several new features, including YARN. This release also shines light on a major problem that companies considering Hadoop initiatives are destined to face the overwhelming lack of Big Data expertise in today’s labour pool.

Increased opportunities for Home-grown Talent:

A lot of companies are looking to retrain their in-house talent pool in Hadoop. Professionals who take active interest in getting trained in Hadoop are given preference over others. Hadoop training gives you an edge over others in your professional growth.

Global Implementation of Hadoop

Online Training – A Practical way for Hadoop Training:

Owing to the busy schedules of IT professional and the impracticality of going for in-class training, the best way to add skill is through online Hadoop training. The flexibility and comfort that comes with the online training is best suited for people who are busy with their professions as well as looking to add more value to their proficiency.

Here is a look at one of the class recording on Hadoop. It is obvious that the online training does not compromise on the learning aspects. The sessions are interactive and give an in-class atmosphere even though not being physically there.

Thursday, 31 July 2014

How Hadoop Solves Big Data Problem

Big data is big in size. Exactly how much data can be classified as big data is not very clear cut, so let's not get bogged down in that debate. For a small company that is used to dealing with data in gigabytes, 10 TB of data would be BIG. However for companies like Facebook and Yahoo, petabytes is big.

Just the size of big data, makes it impossible (or at least cost prohibitive) to store it in traditional storage like databases or conventional filers. We are talking about cost to store gigabytes of data. Using traditional storage filers can cost a lot of money to store big data.

Big Data Is Unstructured or Semi-Structured

A lot of big data is unstructured. For example, click stream log data might look like:

time stamp, user_id, page, referrer_page

Lack of structure makes relational databases not well suited to store big data. Plus, not many databases can cope with storing billions of rows of data.

How Hadoop Solves the Big Data Problem

Hadoop is built to run on a cluster of machines.

Let’s start with an example. Let's say that we need to store lots of photos. We will start with a single disk. When we exceed a single disk, we may use a few disks stacked on a machine. When we max out all the disks on a single machine, we need to get a bunch of machines, each with a bunch of disks.

This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get go.

Hadoop clusters scale horizontally

More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured/semi-structured data

Hadoop doesn't enforce a schema on the data it stores. It can handle arbitrary text and binary data. So Hadoop can digest any unstructured data easily.

Hadoop clusters provides storage and computing

We saw how having separate storage and processing clusters is not the best fit for big data. Hadoop clusters, however, provide storage and distributed computing all in one.

The Business Case for Hadoop

Hadoop provides storage for big data at reasonable cost

Storing big data using traditional storage can be expensive. Hadoop is built around commodity hardware, so it can provide fairly large storage for a reasonable cost. Hadoop has been used in the field at petabyte scale.

One study by Cloudera suggested that enterprises usually spend around $25,000 to $50,000 per terabyte per year. With Hadoop, this cost drops to a few thousand dollars per terabyte per year. As hardware gets cheaper and cheaper, this cost continues to drop.

Hadoop allows for the capture of new or more data

Sometimes organizations don't capture a type of data because it was too cost prohibitive to store it. Since Hadoop provides storage at reasonable cost, this type of data can be captured and stored.

One example would be website click logs. Because the volume of these logs can be very high, not many organizations captured these. Now with Hadoop it is possible to capture and store the logs.

With Hadoop, you can store data longer

To manage the volume of data stored, companies periodically purge older data. For example, only logs for the last three months could be stored, while older logs were deleted. With Hadoop it is possible to store the historical data longer. This allows new analytics to be done on older historical data.

For example, take click logs from a website. A few years ago, these logs were stored for a brief period of time to calculate statistics like popular pages. Now with Hadoop, it is viable to store these click logs for longer period of time.

Hadoop provides scalable analytics

There is no point in storing all this data if we can't analyze them. Hadoop not only provides distributed storage, but also distributed processing as well, which means we can crunch a large volume of data in parallel. The compute framework of Hadoop is called Map Reduce. MapReduce has been proven to the scale of petabytes.

Hadoop provides rich analytics

Native Map Reduce supports Java as a primary programming language. Other languages like Ruby, Python and R can be used as well.

Of course, writing custom MapReduce code is not the only way to analyze data in Hadoop. Higher-level Map Reduce is available. For example, a tool named Pig takes English like data flow language and translates them into MapReduce. Another tool, Hive, takes SQL queries and runs them using MapReduce. Business intelligence (BI) tools can provide even higher level of analysis. There are tools for this type of analysis as well.

QUANTUM Global Academy is famous for providing training in Hadoop and Big Data, PMP, Six Sigma, ITIL, event management, retail management and logistics & supply chain in Gurgaon, Delhi. We are here to help for improving company’s performance and productivity. For more information visit our website www.quantumglobal.org/ or can call us at 01244609530