Special topics in computing big data

1. GROUP 3 MEMBERS FELIX OTIENO MWENDWA KASONGO GEOFFREY LANGAT NICHOLAS LANGAT JULIET DINDI CLEOPHAS KANYOGO

2. INTRODUCTON Big Data is the term for a collection of data sets so large and complex that it becomes difficult to processes using on-hand databases management tools or traditional data processing applications. 3. OVERVIEW Big data can be characterized in terms of volume, velocity, variety, and value. The huge increase in the volume of data has already been mentioned. Terabytes of data are generated daily. This data travels with increasing velocity. Data is generated in real time, and real-time data analysis is required. In addition, more varieties of data are generated (for example, from sources such as social media, equipment sensors, and e-commerce). With an enormous amount of unstructured data, it is harder to derive value from the data. Important information can be hidden among irrelevant data. The biggest challenge is to identify valuable data and then to modify and extract that data so that you can analyze it. 4. HISTORY Facebook, Yahoo and Google found themselves collecting data on an unprecedented scale, referred to as Big Data. The big data collections quickly overwhelmed traditional data systems and techniques Hadoop andMySQL. In early 2000s armies of PhDs developed new techniques like BigTable, MapReduce and Google File System to handle this big data. Today companies in every industry find themselves with the big data problems brought by their daily increased ability to collect information. 5. APPLICATIONS OF BIG DATA Business Healthcare research 6. Application of big data in business Big Data Exploration Find, visualize, understand all big data to improve decision making. Big data exploration addresses the challenge that every large organization faces: information is stored in many different systems and silos and people need access to that data to do their day-to-day work and make important decisions. Enhanced 360 View of the Customer Extend existing customer views by incorporating additional internal and external information sources. Gain a full understanding of customerswhat makes them tick, why they buy, how they prefer to shop, why they switch, what theyll buy next, and what factors lead them to recommend a company to others. 7. Application of big data in business Security Intelligence Extension Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under- leveraged data to significantly improve intelligence, security and law enforcement insight. Operations Analysis Analyze a variety of machine and operational data for improved business results. The abundance and growth of machine data, which can include anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior 8. Applications of big data in business Data Warehouse Modernization Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Offload infrequently accessed or aged data from warehouse and application databases using information integration software and tools. 9. Application of big data in healthcare The chief appeal of Big Data in healthcare lies in two distinct areas. First, the sifting of vast amounts of data to discover trends and patterns within them that help direct the course of treatments, generate new research, and focus on causes that were thus far unclear. In other words, the size of the lens that has been used to view data results has just become that much wider. Secondly, the sheer volume of data that can be processed using Big Data techniques is an enabler for fields such as drug discovery and molecular medicine. Big Data solutions in healthcare are typically seen in advanced analytics such as personalized medicine, drug development, epidemiology, and require massive amounts of data and complex data mining algorithms; real-time applications calling for complex event processing (e.g., patient monitoring, proactive risk management), which requires analysis of numerous real-time data streams; and in unstructured data mining such as keeping practitioners abreast of medical literature more effectively and efficiently, and uncovering patterns in text, images, audio, and video. 10. Applications of big data in health Big Data can enable new types of applications, which in the past might not have been feasible due to scalability or cost constraints. In the past, scalability in many cases was limited due to symmetric multiprocessing (SMP) environments, where a single machine can only be scaled up so much when adding more processors, memory, or disk. On the other hand, MPP enables nearly limitless scalability. Many NoSQL Big Data platforms such as Hadoop and Cassandra are open source software, which can run on commodity hardware, thus driving down hardware and software costs. 11. Application of big data in research As we move toward an era where the digitisation of information is more and more on demand, the overall amount of data takes an exponential growth. The research area BigData has emerged to precisely tackle the vast amounts of data generated, as well as to investigate the strong societal impacts incurred by the explosion of data in the society Supporting BigData Analytics BigData as a concept refers to data that is so large in volume, moving at an unforeseen velocity, with such a high variation in structure and very often veracious in nature that - to be fully exploited, explored and to derive its value - the developement of new techniques and systems is required. The figure to the left depicts a generic architecture for how data from different sources, including the Social Media, medical records, DNA sequences and communication data are typically dealt with. The architecture consists of three main steps: data collection, processing and analysis, as well as value extraction. The ultimate research goal of our team is to develop a computational and statistical framework that can help us extract value from vast amounts of data 12. CHALLENGES OF BIG DATA Data challenges Volume: the main challenge is how to deal with the size of big data. Variety: combining multiple data sets: the challenge is how to handle multiplicity of types, sources and formats. Velocity: one of the key challenges is how to react to the flood of information in the time required by the application. Veracity: data quality, data availability: How can we cope with uncertainty, imprecision, missing values, misstatements or untruths? How good is the data? How broad is the coverage? How fine is the sampling resolution? How timely are the readings? How well understood are the sampling biases? Is there data available, at all? 13. CONTI.. Data discovery: this is a huge challenge: how to find high- quality data from the vast collections of data that are out there on the Web? Quality and relevance: the challenge is determining the quality of data sets and relevance to particular issues (i.e. is the data set making some underlying assumption that renders it biased or not informative for a particular question). Data comprehensiveness: are there areas without coverage? What are the implications? Personally identifiable information: Can we extract enough information to help people without extracting so much as to compromise their privacy? 14. Process challenges A major challenge in this context is how to analyse data. Process challenges in regard to deriving insights include: Capturing data Aligning data from different sources (e.g., resolving when two objects are the same) Transforming the data into a form suitable for analysis Modelling it, whether mathematically, or through some form of simulation Understanding the output, visualizing and sharing the results, considering how to display complex analytics on a mobile device. 15. Management challenges The main management challenges are related to data privacy, security, governance, and ethical issues. The main management related challenges are ensuring that data is used correctly, which means abiding by its intended uses and relevant laws, tracking how the data is used, transformed and derived, as well as managing its lifecycle. According to Michael Blaha, Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits 16. TECHNOLOGIES OF BIG DATA APACHE HADOOP 17. HADOOP Software framework that supports distributed applications licensed under the apache v2 licence. Hadoop was derived from googles Mapreduce and google file System papers Yahoo is the largest contributer to the project Written in java programming language. Hadoop is based in a file system and isnt a db 18. Hadoop Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure. Hadoop has two components: The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between The MapReduce programing paradigm for managing applications on multiple distributed servers Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop 19. Why use Hadoop? Need to process a lot of data(petabyte scale) Need to parallelize processing across a multitude of CPUs Gives scalability with low cost commodity hardware. Open source. 20. Companies using Hadoop Ebay Facebook Microsoft Twitter Amazon.com IBM Last.fm New York Times 21. Common Uses Searches Log Processing Recommendation uses Analytics Video Processing(NASA) 22. Hadoop Components HDFS Map reduce 23. HDFS Hadoop distributed file system is distributed file system Each node in a hadoop instance has a single data node. Achieves reliability by replicating data across multiple hosts(handle hardware failure). Data nodes can communicate with each other. HDFS splits input data into blocks(64/128mb) 24. HDFS BLOCKS 25. Map Reduce Consists in a job tracker Job Tracker assigns a task to idle task tracker nodes in the cluster It uses a map function in parallel to every pair in the input dataset and produce a list of pairs for each call Map(key1,value1)->list(key2,value2) 26. Map reduce Example 27. Hive Is built on top of hadoop for providing data summarization, query and analysis. Provides an SQL-like language called HiveQL Supports SELECT,JOIN,GROUP BY etc. Eg select yearofpublication, count(booktitle) from bxdataset group by yearofpublication; 28. THE FUTURE OF BIG DATA 29. Future of Big Data In the future well be able to store and process more data than we can now. Data will be able to be stored in many ways apart from storing more data. In future Hadoop will get better. More things will get integrated and that trend will continue. More and more data will move out of silo systems and into central systems that provide a variety of tools running on a variety of datasets essentially an enterprise data hub. In future the enterprises that will do best are those that will best leverage their technology.

Special topics in computing big data

Documents