Thursday, September 5, 2019

Data Anonymization in Cloud Computing

Data Anonymization in Cloud Computing Data Anonymization Approach For Privacy  Preserving In Cloud Saranya M Abstract—Private data such as electronic health records  and banking transactions must be shared within the cloud  environment to analysis or mine data for research purposes. Data privacy is one of the most concerned issues in big data  applications, because processing large-scale sensitive data sets  often requires computation power provided by public cloud  services. A technique called Data Anonymization, the privacy  of an individual can be preserved while aggregate information  is shared for mining purposes. Data Anonymization is a  concept of hiding sensitive data items of the data owner. A  bottom-up generalization for transforming more specific data  to less specific but semantically consistent data for privacy  protection. The idea is to explore the data generalization from  data mining to hide detailed data, rather than discovering the  patterns. When the data is masked, data mining techniques  can be applied without modification. Keywords—Data Anonymization; Cloud; Bottom Up Generalization; Mapreduce; Privacy Preservation. I. INTRODUCTION Cloud Computing refers to configuring, manipulating,  and accessing the applications through online. It provides  online data storage, infrastructure and application.which is  a disruptive trend which poses a significant impact on  current IT industry and research communities [1]. Cloud  computing provides massive storage capacity computation  power and by utilizing a large number of commodity  computers together. It enable users to deploy applications  with low cost, without high investment in infrastructure. Due to privacy and security problem, numerous potential  customers are still hesitant to take advantage of cloud  [7].However, Cloud computing reduce costs through  optimization and increased operating and economic  efficiencies and enhance collaboration, agility, and scale, by  enabling a global computing model over the Internet  infrastructure. However, without proper security and  privacy solutions for clouds, this potentially cloud  computing paradigm could become a huge failure. Cloud delivery models are classified into three. They are  software as a service (saas), platform as a service (paas)  and infrastructure as a service (iaas). Saas is very similar to  the old thin-client model of software provision, clients  where usually web browsers, provides the point of access  to running software on servers.Paas provides a platform on  which software can be developed and deployed. Iaas is  comprised of highly automated and scalable computer  resources, complemented by cloud storage and network  capability which can be metered ,self-provisioned and  available on-demand[7]. Cloud is deployed using some models which include  public, private and hybrid clouds. A public cloud is one in  which the services and infrastructure are provided off-site  over the Internet. A private cloud is one in which the  services and infrastructure are maintained on a private  network. Those clouds offer a great level of security. A  hybrid cloud includes a variety of public and private  options with multiple providers. Big data environments require clusters of servers to  support the tools that process the large volumes of data,  with high velocity and with varied formats of big data. Clouds are deployed on pools of server, networking  resources , storage and can scale up or down as needed for  convenience. Cloud computing provides a cost-effective way for  supporting big data techniques and advanced applications  that drives business value. Big data analytics is a set of  advanced technologies designed to work with large  volumes of data. It uses different quantitative methods like  computational mathematics, machine learning, robotics,  neural networks and artificial intelligence to explore the  data in cloud. In cloud infrastructure to analyze big data makes sense  because Investments in big data analysis can be significant  and drive a need for efficient and cost-effective  infrastructure, Big data combines internal and external  sources as well as Data services that are needed to extract  value from big data[17]. To address the scalability problem for large scale data set  used a widely adopted parallel data processing framework  like Map Reduce. In first phase, the original datasets are  partitioned into group of smaller datasets. Now those  datasets are anonymized in parallel producing intermediate  results. In second phase, the obtained intermediate results  are integrated into one and further anonymized to achieve  consistent k-anonymous dataset. Mapreduce is a model for programming and Implementing  for processing and generating large data items. A map  function that processes a key-value pair,This generates a  set of intermediate key-value pair. A reduce function which  merges all intermediate data values associated with those  intermediate key. II. RELATED WORK Ke Wang, Philip S. Yu , Sourav Chakraborty adapts an  bottom-up generalization approach which works iteratively  to generalize the data. These generalized data is useful for  classification.But it is difficult to link to other sources. A  hierarchical structure of generalizations specifies the  generalization space.Identifying the best generalization is  the key to climb up the hierarchy at each iteration[2]. Benjamin c. M. Fung, ke wang discuss that privacy preserving  technology is used to solve some problems  only,But it is important to identify the nontechnical  difficulties and overcome faced by decision makers when  deploying a privacy-preserving technology. Their  concerns include the degradation of data quality, increased  costs , increased complexity and loss of valuable  information. They think that cross-disciplinary research is  the key to remove these problems and urge scientists in the  privacy protection field to conduct cross-disciplinary  research with social scientists in sociology, psychology,  and public policy studies[3]. Jiuyong Li,Jixue Liu , Muzammil Baig , Raymond Chi-Wing Wong proposed two classification-aware data  anonymization methods .It combines local value  suppression and global attribute generalization. The  attribute generalization is found by the data distribution,  instead of privacy requirement. Generalization levels are  optimized by normalizing mutual information for  preserving classification capability[17]. Xiaokui Xiao Yufei Tao present a technique,called  anatomy, for publishing sensitive datasets. Anatomy is the  process of releasing all the quasi-identifier and sensitive  data items directly in two separate tables. This approach  protect the privacy and capture large amount of correlation  in microdata by Combining with a grouping mechanism. A linear-time algorithm for computing anatomized tables  that obey the l-diversity privacy requirement is developed  which minimizes the error of reconstructing microdata  [13]. III. PROBLEM ANALYSIS The centralized Top Down Specialization (TDS)  approaches exploits the data structure to improve  scalability and efficiency by indexing anonymous data  records. But overheads may be incurred by maintaining  linkage structure and updating the statistic information  when date sets become large.So,centralized approaches  probably suffer from problem of low efficiency and  scalability while handling large-scale data sets. A  distributed TDS approach is proposed to address the  anonymization problem in distributed system.It  concentrates on privacy protection rather than scalability  issues.This approach employs information gain only, but  not its privacy loss. [1] Indexing data structures speeds up the process of  anonymization of data and generalizing it, because  indexing data structure avoids frequently scanning the  whole data[15]. These approaches fails to work in parallel  or distributed environments such as cloud systems since  the indexing structures are centralized. Centralized  approaches are difficult in handling large-scale data sets  well on cloud using just one single VM even if the VM has  the highest computation and storage capability. Fung et.al proposed TDS approach which produces an  anonymize data set with exploration problem on data. A  data structure taxonomy indexed partition [TIPS] is  exploited which improves efficiency of TDS, it fails to  handle large data set. But this approach is centralized  leasing to in adequacy of large data set. Raj H, Nathuji R, Singh A, England P proposes cache  hierarchy aware core assignment and page coloring based  cache partitioning to provide resource isolation and better  resource management by which it guarantees security of  data during processing.But Page coloring approach  enforces the performance degradation in case VM’s  working set doesn’t fit in cache partition[14]. Ke Wang , Philip S. Yu considers the following  problem. Data holder needs to release a version of data that  are used for building classification models. But the problem  is privacy protection and wants to protect against an  external source for sensitive information. So by adapting the iterative bottom-up generalization  approach to generalize the data from data mining. IV. METHODOLOGY Suppression: In this method, certain values of the  attributes are replaced by an asterisk *. All or some values  of a column may be replaced by * Generalization: In this method, individual values of  attributes are replaced by with a broader category. For  example, the value 19 of the attribute Age may be  replaced by ≠¤ 20, the value 23 by 20 A. Bottom-Up Generalization Bottom-Up Generalization is one of the efficient kanonymization  methods. K-Anonymity where the  attributes are suppressed or generalized until each row is  identical with at least k-1 other rows. Now database is said  to be k-anonymous. Bottom-Up Generalization (BUG)  approach of anonymization is the process of starting from  the lowest anonymization level which is iteratively  performed. We leverage privacy trade-off as the search  metric. Bottom-Up Generalization and MR Bottom up  Generalization (MRBUG) Driver are used. The following  steps of the Advanced BUG are ,they are data partition, run  MRBUG Driver on data set, combines all anonymization  levels of the partitioned data items and then apply  generalization to original data set without violating the kanonymity. Fig.1 System architecture of bottom up approach   Here a Advanced Bottom-Up Generalization approach  which improves the scalability and performance of BUG. Two levels of parallelization which is done by  mapreduce(MR) on cloud environment. Mapreduce on  cloud has two levels of parallelization.First is job level  parallelization which means multiple MR jobs can be  executed simultaneously that makes full use of cloud  infrastructure.Second one is task level parallelization  which means that multiple mapper or reducer tasks in a  MR job are executed simultaneously on data partitions. The  following steps are performed in our approach, First the  datasets are split up into smaller datasets by using several  job level mapreduce, and then the partitioned data sets are  anonymized Bottom up Generalization Driver. Then the  obtained intermediate anonymization levels are Integrated  into one. Ensure that all integrated intermediate level never  violates K-anonmity property. Obtaining then the merged  intermediate anonymized dataset Then the driver is  executed on original data set, and produce the resultant  an onymization level. The Algorithm for Advanced Bottom  Up Generalization[15] is given below, The above algorithm describes bottom-up generalization. In  ith iteration, generalize R by the best generalization Gbest . B. Mapreduce The Map framework which is classified into map and  reduce functions.Map is a function which parcels out task  to other different nodes in distributed cluster. Reduce is a  function that collates the task and resolves results into  single value. Fig.2 MapReduce Framework The MR framework is fault-tolerant since each node in  cluster had to report back with status updates and  completed work periodically.For example if a node  remains static for longer interval than the expected,then a  master node notes it and re-assigns that task to other  nodes.A single MR job is inadequate to accomplish task. So, a group of MR jobs are orchestrated in one MR driver  to achieve the task. MR framework consists of MR Driver  and two types of jobs.One is IGPL Initialization and other  is IGPL Update. The MR driver arranges the execution of  jobs. Hadoop which provides the mechanism to set global  variables for the Mappers and the Reducers. The best  Specialization which is passed into Map function of IGPL  Update job.In Bottom-Up Approach, the data is initialized  first to its current state.Then the generalizations process are  carried out k -anonymity is not violated. That is, we have to  climb the Taxonomy Tree of the attribute till required Anonymity is achieved. 1: while R that does not satisfy anonymity requirement do 2: for all generalizations G do 3: compute the IP(G); 4: end for; 5: find best generalization Gbest; 6: generalize R through Gbest; 7: end while; 8: output R; V. Experiment Evaluation To explore the data generalization from data mining in  order to hide the detailed information, rather to discover  the patterns and trends. Once the data has been masked, all  the standard data mining techniques can be applied without  modifying it. Here data mining technique not only discover  useful patterns, but also masks the private information   Fig.3 Change of execution time of TDS and BUG   Fig 3 shows the results of change in execution time of  TDS and BUG algorithm. We compared the execution time  of TDS and BUG for the size of EHR ranging from 50 to  500 MB, keeping p=1. Presenting the bottom-up  generalization for transforming the specific data to less  specific. Thus focusing on key issues to achieve quality  and scalability. The quality is addressed by trade-off  information and privacy and an bottom-up generalization  approach.The scalability is addressed by a novel data  structure to focus generalizations.To evaluate efficiency  and effectiveness of BUG approach, thus we compare  BUG with TDS.Experiments are performed in cloud  environment.These approaches are implemented in Java  language and standard Hadoop MapReduce API. VI. CONCLUSION Here we studied scalability problem for anonymizing the  data on cloud for big data applications by using Bottom Up  Generalization and proposes a scalable Bottom Up  Generalization. The BUG approach performed as  follows,first Data partitioning ,executing of driver that  produce a intermediate result. After that, these results are  merged into one and apply a generalization approach. This  produces the anonymized data. The data anonymization is  done using MR Framework on cloud.This shows that  scalability and efficiency are improved significantly over  existing approaches. REFERENCES [1] Xuyun Zhang, Laurence T. Yang, Chang Liu, and Jinjun Chen,â€Å"A  Scalable Two-Phase Top-Down Specialization Approach for Data  Anonymization Using MapReduce on Cloud†, vol. 25, no. 2,  february 2014. [2] Ke Wang, Yu, P.S,Chakraborty, S, â€Å" Bottom-up generalization: a  data mining solution to privacy protection† [3] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, â€Å"Privacy-Preserving  Data Publishing: A Survey of Recent Developments,† ACM  Comput. Surv., vol. 42, no. 4, pp.1-53, 2010. [4] K. LeFevre, D.J. DeWitt and R. Ramakrishnan, â€Å"Workload- Aware  Anonymization Techniques for Large-Scale Datasets,† ACM Trans.  Database Syst., vol. 33, no. 3, pp. 1-47, 2008. [5] B. Fung, K. Wang, L. Wang and P.C.K. Hung, â€Å"Privacy- Preserving  Data Publishing for Cluster Analysis,† Data Knowl.Eng., Vol.68,  no.6, pp. 552-575, 2009. [6] B.C.M. Fung, K. Wang, and P.S. Yu, â€Å"Anonymizing Classification  Data for Privacy Preservation,† IEEE Trans. Knowledge and Data  Eng., vol. 19, no. 5, pp. 711-725, May 2007. [7] Hassan Takabi, James B.D. Joshi and Gail-Joon Ahn, â€Å"Security and  Privacy Challenges in Cloud Computing Environments†. [8] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, â€Å"Incognito:  Efficient Full-Domain K-Anonymity,† Proc. ACM SIGMOD Int’l  Conf. Management of Data (SIGMOD ’05), pp. 49-60, 2005. [9] T. IwuchukwuandJ.F. Naughton, â€Å"K-Anonymization as Spatial  Indexing: Toward Scalable and Incremental Anonymization,† Proc.  33rdIntlConf. VeryLarge DataBases (VLDB07), pp.746-757, 2007 [10] J. Dean and S. Ghemawat, â€Å"Mapreduce: Simplified Data Processing  on Large Clusters,† Comm. ACM, vol. 51, no. 1, pp. 107-113,2008. [11] Dean J, Ghemawat S. â€Å"Mapreduce: a flexible data processing tool,†Ã‚  Communications of the ACM 2010;53(1):72–77. DOI:  10.1145/1629175.1629198. [12] Jiuyong Li, Jixue Liu , Muzammil Baig , Raymond Chi-Wing  Wong, â€Å"Information based data anonymization for classification  utility† [13]X. Xiao and Y. Tao, â€Å"Anatomy: Simple and Effective Privacy  Preservation,† Proc. 32nd Int’l Conf. Very Large Data Bases  (VLDB’06), pp. 139-150, 2006. [14] Raj H, Nathuji R, Singh A, England P. â€Å"Resource management for  isolation enhanced cloud services,† In: Proceedings of the  2009ACM workshop on cloud computing security, Chicago, Illinois,  USA, 2009, p.77–84. [15] K.R.Pandilakshmi, G.Rashitha Banu. â€Å"An Advanced Bottom up  Generalization Approach for Big Data on Cloud† , Volume: 03, June  2014, Pages: 1054-1059.. [16] Intel â€Å"Big Data in the Cloud: Converging Technologies†. [17] Jiuyong Li, Jixue Liu Muzammil Baig, Raymond Chi-Wing Wong,  Ã¢â‚¬Å"Information based data anonymization for classification utility†.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.