Law school essays: Data Anonymization in Cloud Computing

Data Anonymization in Cloud Computing Data Anonymization Approach For PrivacyÃ‚ Preserving In Cloud Saranya M AbstractÃ¢â‚¬â€Private data such as electronic health recordsÃ‚ and banking transactions must be shared within the cloudÃ‚ environment to analysis or mine data for research purposes. Data privacy is one of the most concerned issues in big dataÃ‚ applications, because processing large-scale sensitive data setsÃ‚ often requires computation power provided by public cloudÃ‚ services. A technique called Data Anonymization, the privacyÃ‚ of an individual can be preserved while aggregate informationÃ‚ is shared for mining purposes. Data Anonymization is aÃ‚ concept of hiding sensitive data items of the data owner. AÃ‚ bottom-up generalization for transforming more specific dataÃ‚ to less specific but semantically consistent data for privacyÃ‚ protection. The idea is to explore the data generalization fromÃ‚ data mining to hide detailed data, rather than discovering theÃ‚ patterns. When the data is masked, data mining techniquesÃ‚ can be applied without modification. KeywordsÃ¢â‚¬â€Data Anonymization; Cloud; Bottom Up Generalization; Mapreduce; Privacy Preservation. I. INTRODUCTION Cloud Computing refers to configuring, manipulating,Ã‚ and accessing the applications through online. It providesÃ‚ online data storage, infrastructure and application.which isÃ‚ a disruptive trend which poses a significant impact onÃ‚ current IT industry and research communities [1]. CloudÃ‚ computing provides massive storage capacity computationÃ‚ power and by utilizing a large number of commodityÃ‚ computers together. It enable users to deploy applicationsÃ‚ with low cost, without high investment in infrastructure. Due to privacy and security problem, numerous potentialÃ‚ customers are still hesitant to take advantage of cloudÃ‚ [7].However, Cloud computing reduce costs throughÃ‚ optimization and increased operating and economicÃ‚ efficiencies and enhance collaboration, agility, and scale, byÃ‚ enabling a global computing model over the InternetÃ‚ infrastructure. However, without proper security andÃ‚ privacy solutions for clouds, this potentially cloudÃ‚ computing paradigm could become a huge failure. Cloud delivery models are classified into three. They areÃ‚ software as a service (saas), platform as a service (paas)Ã‚ and infrastructure as a service (iaas). Saas is very similar toÃ‚ the old thin-client model of software provision, clientsÃ‚ where usually web browsers, provides the point of accessÃ‚ to running software on servers.Paas provides a platform onÃ‚ which software can be developed and deployed. Iaas isÃ‚ comprised of highly automated and scalable computerÃ‚ resources, complemented by cloud storage and networkÃ‚ capability which can be metered ,self-provisioned andÃ‚ available on-demand[7]. Cloud is deployed using some models which includeÃ‚ public, private and hybrid clouds. A public cloud is one inÃ‚ which the services and infrastructure are provided off-siteÃ‚ over the Internet. A private cloud is one in which theÃ‚ services and infrastructure are maintained on a privateÃ‚ network. Those clouds offer a great level of security. AÃ‚ hybrid cloud includes a variety of public and privateÃ‚ options with multiple providers. Big data environments require clusters of servers toÃ‚ support the tools that process the large volumes of data,Ã‚ with high velocity and with varied formats of big data. Clouds are deployed on pools of server, networkingÃ‚ resources , storage and can scale up or down as needed forÃ‚ convenience. Cloud computing provides a cost-effective way forÃ‚ supporting big data techniques and advanced applicationsÃ‚ that drives business value. Big data analytics is a set ofÃ‚ advanced technologies designed to work with largeÃ‚ volumes of data. It uses different quantitative methods likeÃ‚ computational mathematics, machine learning, robotics,Ã‚ neural networks and artificial intelligence to explore theÃ‚ data in cloud. In cloud infrastructure to analyze big data makes senseÃ‚ because Investments in big data analysis can be significantÃ‚ and drive a need for efficient and cost-effectiveÃ‚ infrastructure, Big data combines internal and externalÃ‚ sources as well as Data services that are needed to extractÃ‚ value from big data[17]. To address the scalability problem for large scale data setÃ‚ used a widely adopted parallel data processing frameworkÃ‚ like Map Reduce. In first phase, the original datasets areÃ‚ partitioned into group of smaller datasets. Now thoseÃ‚ datasets are anonymized in parallel producing intermediateÃ‚ results. In second phase, the obtained intermediate resultsÃ‚ are integrated into one and further anonymized to achieveÃ‚ consistent k-anonymous dataset. Mapreduce is a model for programming and ImplementingÃ‚ for processing and generating large data items. A mapÃ‚ function that processes a key-value pair,This generates aÃ‚ set of intermediate key-value pair. A reduce function whichÃ‚ merges all intermediate data values associated with thoseÃ‚ intermediate key. II. RELATED WORK Ke Wang, Philip S. Yu , Sourav Chakraborty adapts anÃ‚ bottom-up generalization approach which works iterativelyÃ‚ to generalize the data. These generalized data is useful forÃ‚ classification.But it is difficult to link to other sources. AÃ‚ hierarchical structure of generalizations specifies theÃ‚ generalization space.Identifying the best generalization isÃ‚ the key to climb up the hierarchy at each iteration[2]. Benjamin c. M. Fung, ke wang discuss that privacy preservingÃ‚ technology is used to solve some problemsÃ‚ only,But it is important to identify the nontechnicalÃ‚ difficulties and overcome faced by decision makers whenÃ‚ deploying a privacy-preserving technology. TheirÃ‚ concerns include the degradation of data quality, increasedÃ‚ costs , increased complexity and loss of valuableÃ‚ information. They think that cross-disciplinary research isÃ‚ the key to remove these problems and urge scientists in theÃ‚ privacy protection field to conduct cross-disciplinaryÃ‚ research with social scientists in sociology, psychology,Ã‚ and public policy studies[3]. Jiuyong Li,Jixue Liu , Muzammil Baig , Raymond Chi-Wing Wong proposed two classification-aware dataÃ‚ anonymization methods .It combines local valueÃ‚ suppression and global attribute generalization. TheÃ‚ attribute generalization is found by the data distribution,Ã‚ instead of privacy requirement. Generalization levels areÃ‚ optimized by normalizing mutual information forÃ‚ preserving classification capability[17]. Xiaokui Xiao Yufei Tao present a technique,calledÃ‚ anatomy, for publishing sensitive datasets. Anatomy is theÃ‚ process of releasing all the quasi-identifier and sensitiveÃ‚ data items directly in two separate tables. This approachÃ‚ protect the privacy and capture large amount of correlationÃ‚ in microdata by Combining with a grouping mechanism. A linear-time algorithm for computing anatomized tablesÃ‚ that obey the l-diversity privacy requirement is developedÃ‚ which minimizes the error of reconstructing microdataÃ‚ [13]. III. PROBLEM ANALYSIS The centralized Top Down Specialization (TDS)Ã‚ approaches exploits the data structure to improveÃ‚ scalability and efficiency by indexing anonymous dataÃ‚ records. But overheads may be incurred by maintainingÃ‚ linkage structure and updating the statistic informationÃ‚ when date sets become large.So,centralized approachesÃ‚ probably suffer from problem of low efficiency andÃ‚ scalability while handling large-scale data sets. AÃ‚ distributed TDS approach is proposed to address theÃ‚ anonymization problem in distributed system.ItÃ‚ concentrates on privacy protection rather than scalabilityÃ‚ issues.This approach employs information gain only, butÃ‚ not its privacy loss. [1] Indexing data structures speeds up the process ofÃ‚ anonymization of data and generalizing it, becauseÃ‚ indexing data structure avoids frequently scanning theÃ‚ whole data[15]. These approaches fails to work in parallelÃ‚ or distributed environments such as cloud systems sinceÃ‚ the indexing structures are centralized. CentralizedÃ‚ approaches are difficult in handling large-scale data setsÃ‚ well on cloud using just one single VM even if the VM hasÃ‚ the highest computation and storage capability. Fung et.al proposed TDS approach which produces anÃ‚ anonymize data set with exploration problem on data. AÃ‚ data structure taxonomy indexed partition [TIPS] isÃ‚ exploited which improves efficiency of TDS, it fails toÃ‚ handle large data set. But this approach is centralizedÃ‚ leasing to in adequacy of large data set. Raj H, Nathuji R, Singh A, England P proposes cacheÃ‚ hierarchy aware core assignment and page coloring basedÃ‚ cache partitioning to provide resource isolation and betterÃ‚ resource management by which it guarantees security ofÃ‚ data during processing.But Page coloring approachÃ‚ enforces the performance degradation in case VMÃ¢â‚¬â„¢sÃ‚ working set doesnÃ¢â‚¬â„¢t fit in cache partition[14]. Ke Wang , Philip S. Yu considers the followingÃ‚ problem. Data holder needs to release a version of data thatÃ‚ are used for building classification models. But the problemÃ‚ is privacy protection and wants to protect against anÃ‚ external source for sensitive information. So by adapting the iterative bottom-up generalizationÃ‚ approach to generalize the data from data mining. IV. METHODOLOGY Suppression: In this method, certain values of theÃ‚ attributes are replaced by an asterisk *. All or some valuesÃ‚ of a column may be replaced by * Generalization: In this method, individual values ofÃ‚ attributes are replaced by with a broader category. ForÃ‚ example, the value 19 of the attribute Age may beÃ‚ replaced by Ã¢â€° ¤ 20, the value 23 by 20 A. Bottom-Up Generalization Bottom-Up Generalization is one of the efficient kanonymizationÃ‚ methods. K-Anonymity where theÃ‚ attributes are suppressed or generalized until each row isÃ‚ identical with at least k-1 other rows. Now database is saidÃ‚ to be k-anonymous. Bottom-Up Generalization (BUG)Ã‚ approach of anonymization is the process of starting fromÃ‚ the lowest anonymization level which is iterativelyÃ‚ performed. We leverage privacy trade-off as the searchÃ‚ metric. Bottom-Up Generalization and MR Bottom upÃ‚ Generalization (MRBUG) Driver are used. The followingÃ‚ steps of the Advanced BUG are ,they are data partition, runÃ‚ MRBUG Driver on data set, combines all anonymizationÃ‚ levels of the partitioned data items and then applyÃ‚ generalization to original data set without violating the kanonymity. Fig.1 System architecture of bottom up approachÃ‚ Here a Advanced Bottom-Up Generalization approachÃ‚ which improves the scalability and performance of BUG. Two levels of parallelization which is done byÃ‚ mapreduce(MR) on cloud environment. Mapreduce onÃ‚ cloud has two levels of parallelization.First is job levelÃ‚ parallelization which means multiple MR jobs can beÃ‚ executed simultaneously that makes full use of cloudÃ‚ infrastructure.Second one is task level parallelizationÃ‚ which means that multiple mapper or reducer tasks in aÃ‚ MR job are executed simultaneously on data partitions. TheÃ‚ following steps are performed in our approach, First theÃ‚ datasets are split up into smaller datasets by using severalÃ‚ job level mapreduce, and then the partitioned data sets areÃ‚ anonymized Bottom up Generalization Driver. Then theÃ‚ obtained intermediate anonymization levels are IntegratedÃ‚ into one. Ensure that all integrated intermediate level neverÃ‚ violates K-anonmity property. Obtaining then the mergedÃ‚ intermediate anonymized dataset Then the driver isÃ‚ executed on original data set, and produce the resultantÃ‚ an onymization level. The Algorithm for Advanced BottomÃ‚ Up Generalization[15] is given below, The above algorithm describes bottom-up generalization. InÃ‚ ith iteration, generalize R by the best generalization Gbest . B. Mapreduce The Map framework which is classified into map andÃ‚ reduce functions.Map is a function which parcels out taskÃ‚ to other different nodes in distributed cluster. Reduce is aÃ‚ function that collates the task and resolves results intoÃ‚ single value. Fig.2 MapReduce Framework The MR framework is fault-tolerant since each node inÃ‚ cluster had to report back with status updates andÃ‚ completed work periodically.For example if a nodeÃ‚ remains static for longer interval than the expected,then aÃ‚ master node notes it and re-assigns that task to otherÃ‚ nodes.A single MR job is inadequate to accomplish task. So, a group of MR jobs are orchestrated in one MR driverÃ‚ to achieve the task. MR framework consists of MR DriverÃ‚ and two types of jobs.One is IGPL Initialization and otherÃ‚ is IGPL Update. The MR driver arranges the execution ofÃ‚ jobs. Hadoop which provides the mechanism to set globalÃ‚ variables for the Mappers and the Reducers. The bestÃ‚ Specialization which is passed into Map function of IGPLÃ‚ Update job.In Bottom-Up Approach, the data is initializedÃ‚ first to its current state.Then the generalizations process areÃ‚ carried out k -anonymity is not violated. That is, we have toÃ‚ climb the Taxonomy Tree of the attribute till required Anonymity is achieved. 1: while R that does not satisfy anonymity requirement do 2: for all generalizations G do 3: compute the IP(G); 4: end for; 5: find best generalization Gbest; 6: generalize R through Gbest; 7: end while; 8: output R; V. Experiment Evaluation To explore the data generalization from data mining inÃ‚ order to hide the detailed information, rather to discoverÃ‚ the patterns and trends. Once the data has been masked, allÃ‚ the standard data mining techniques can be applied withoutÃ‚ modifying it. Here data mining technique not only discoverÃ‚ useful patterns, but also masks the private informationÃ‚ Fig.3 Change of execution time of TDS and BUGÃ‚ Fig 3 shows the results of change in execution time ofÃ‚ TDS and BUG algorithm. We compared the execution timeÃ‚ of TDS and BUG for the size of EHR ranging from 50 toÃ‚ 500 MB, keeping p=1. Presenting the bottom-upÃ‚ generalization for transforming the specific data to lessÃ‚ specific. Thus focusing on key issues to achieve qualityÃ‚ and scalability. The quality is addressed by trade-offÃ‚ information and privacy and an bottom-up generalizationÃ‚ approach.The scalability is addressed by a novel dataÃ‚ structure to focus generalizations.To evaluate efficiencyÃ‚ and effectiveness of BUG approach, thus we compareÃ‚ BUG with TDS.Experiments are performed in cloudÃ‚ environment.These approaches are implemented in JavaÃ‚ language and standard Hadoop MapReduce API. VI. CONCLUSION Here we studied scalability problem for anonymizing theÃ‚ data on cloud for big data applications by using Bottom UpÃ‚ Generalization and proposes a scalable Bottom UpÃ‚ Generalization. The BUG approach performed asÃ‚ follows,first Data partitioning ,executing of driver thatÃ‚ produce a intermediate result. After that, these results areÃ‚ merged into one and apply a generalization approach. ThisÃ‚ produces the anonymized data. The data anonymization isÃ‚ done using MR Framework on cloud.This shows thatÃ‚ scalability and efficiency are improved significantly overÃ‚ existing approaches. REFERENCES [1] Xuyun Zhang, Laurence T. Yang, Chang Liu, and Jinjun Chen,Ã¢â‚¬Å"AÃ‚ Scalable Two-Phase Top-Down Specialization Approach for DataÃ‚ Anonymization Using MapReduce on CloudÃ¢â‚¬ , vol. 25, no. 2,Ã‚ february 2014. [2] Ke Wang, Yu, P.S,Chakraborty, S, Ã¢â‚¬Å" Bottom-up generalization: aÃ‚ data mining solution to privacy protectionÃ¢â‚¬ [3] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, Ã¢â‚¬Å"Privacy-PreservingÃ‚ Data Publishing: A Survey of Recent Developments,Ã¢â‚¬ ACMÃ‚ Comput. Surv., vol. 42, no. 4, pp.1-53, 2010. [4] K. LeFevre, D.J. DeWitt and R. Ramakrishnan, Ã¢â‚¬Å"Workload- AwareÃ‚ Anonymization Techniques for Large-Scale Datasets,Ã¢â‚¬ ACM Trans.Ã‚ Database Syst., vol. 33, no. 3, pp. 1-47, 2008. [5] B. Fung, K. Wang, L. Wang and P.C.K. Hung, Ã¢â‚¬Å"Privacy- PreservingÃ‚ Data Publishing for Cluster Analysis,Ã¢â‚¬ Data Knowl.Eng., Vol.68,Ã‚ no.6, pp. 552-575, 2009. [6] B.C.M. Fung, K. Wang, and P.S. Yu, Ã¢â‚¬Å"Anonymizing ClassificationÃ‚ Data for Privacy Preservation,Ã¢â‚¬ IEEE Trans. Knowledge and DataÃ‚ Eng., vol. 19, no. 5, pp. 711-725, May 2007. [7] Hassan Takabi, James B.D. Joshi and Gail-Joon Ahn, Ã¢â‚¬Å"Security andÃ‚ Privacy Challenges in Cloud Computing EnvironmentsÃ¢â‚¬ . [8] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, Ã¢â‚¬Å"Incognito:Ã‚ Efficient Full-Domain K-Anonymity,Ã¢â‚¬ Proc. ACM SIGMOD IntÃ¢â‚¬â„¢lÃ‚ Conf. Management of Data (SIGMOD Ã¢â‚¬â„¢05), pp. 49-60, 2005. [9] T. IwuchukwuandJ.F. Naughton, Ã¢â‚¬Å"K-Anonymization as SpatialÃ‚ Indexing: Toward Scalable and Incremental Anonymization,Ã¢â‚¬ Proc.Ã‚ 33rdIntlConf. VeryLarge DataBases (VLDB07), pp.746-757, 2007 [10] J. Dean and S. Ghemawat, Ã¢â‚¬Å"Mapreduce: Simplified Data ProcessingÃ‚ on Large Clusters,Ã¢â‚¬ Comm. ACM, vol. 51, no. 1, pp. 107-113,2008. [11] Dean J, Ghemawat S. Ã¢â‚¬Å"Mapreduce: a flexible data processing tool,Ã¢â‚¬ Ã‚ Communications of the ACM 2010;53(1):72Ã¢â‚¬â€œ77. DOI:Ã‚ 10.1145/1629175.1629198. [12] Jiuyong Li, Jixue Liu , Muzammil Baig , Raymond Chi-WingÃ‚ Wong, Ã¢â‚¬Å"Information based data anonymization for classificationÃ‚ utilityÃ¢â‚¬ [13]X. Xiao and Y. Tao, Ã¢â‚¬Å"Anatomy: Simple and Effective PrivacyÃ‚ Preservation,Ã¢â‚¬ Proc. 32nd IntÃ¢â‚¬â„¢l Conf. Very Large Data BasesÃ‚ (VLDBÃ¢â‚¬â„¢06), pp. 139-150, 2006. [14] Raj H, Nathuji R, Singh A, England P. Ã¢â‚¬Å"Resource management forÃ‚ isolation enhanced cloud services,Ã¢â‚¬ In: Proceedings of theÃ‚ 2009ACM workshop on cloud computing security, Chicago, Illinois,Ã‚ USA, 2009, p.77Ã¢â‚¬â€œ84. [15] K.R.Pandilakshmi, G.Rashitha Banu. Ã¢â‚¬Å"An Advanced Bottom upÃ‚ Generalization Approach for Big Data on CloudÃ¢â‚¬ , Volume: 03, JuneÃ‚ 2014, Pages: 1054-1059.. [16] Intel Ã¢â‚¬Å"Big Data in the Cloud: Converging TechnologiesÃ¢â‚¬ . [17] Jiuyong Li, Jixue Liu Muzammil Baig, Raymond Chi-Wing Wong,Ã‚ Ã¢â‚¬Å"Information based data anonymization for classification utilityÃ¢â‚¬ .

Law school essays

Thursday, September 5, 2019

Data Anonymization in Cloud Computing

No comments:

Post a Comment