欢迎来到留学生英语论文网

当前位置:首页 > 论文范文 > Computer Science

Preventing data leakage detection by automation segmentation

发布时间:2018-03-30
该论文是我们的学员投稿,并非我们专家级的写作水平!如果你有论文作业写作指导需求请联系我们的客服人员

Preventing data leakage detection by automaton segmentation

Abstarct: A data distributor has given sensitive data to a set of incredibly secured agents. Much data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The coordinator must assess the history that the leaked data came from one or more agents, independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of recognizing leakages. These process do not rely on replacement of the released data (e.g., watermarks). In some cases we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.

1. INTRODUCTION:

In the course of business, sometimes important data must be handed over to expected trusted parties. For instance, a hospital may give patient records to researchers who will carry out new treatments. Similarly, a company may have tie up with other companies that require sharing client data. Another expertise may outsource its data processing, so data must be given to many other companies. We call the vendor of the data the distributor and the supposedly trusted third parties the agents. Our aim is to snoof when the distributor’s important data has been leaked by agents, and if possible to find the agent that malware the data.

Traditionally, leakage is handled by watermarking, e.g., a different code is embedded in each private data. If that data is later found in the hands of an unknown party, the leaker can be found. Watermarks can be much useful in some cases, but again, involve some alteration of the real data. Furthermore, watermarks can sometimes be avoided if the data recipient is malicious.

In this paper we study unobtrusive techniques for detecting leakage of a set of entities or records. Specifically we study the scenario: After giving a set of entities to agents, the distributor finds some of those similar objects in an unsecured place. (For instance, the information may be found on a web site, or may be obtained through a legal process.) At point the vendor can assess the leaked data came from one or more clients, as opposed to having been independently accessed. Using an analogy with product stolen from a cookie jar, if we catch jack with a single , he can argue that his colleague gave him the cookie.

In this paper we develop a model for revising the “guilt” of clients. We also provide algorithms for distributing objects to clients, in a way that revises our chances of identifying a leaker. Finally, we also consider the of adding “fake” objects to the distributed set. Such objects do not correspond to real process but appear realistic to the agents. In a sense, the duplicate objects acts as a type of watermark for the whole set, without modifying any individual clients. If it turns out client was given one or more duplicate objects that were leaked, then the coordinator can be more assertive that agent was guilty.

The members in a dataspace are the single data sources and software packages. They can be safeguarded or streamed (managed locally by data stream systems), or even sensor deployments. Some members may support expressive queries, while others are snoofy and offer only limited for posing queries (e.g., structured files, web services, or other packages).

Participants vary from being(e.g., relational databases) to semi-structured (XML, code collections) to completely unstructured. Some sources will traditional updates, while others may be append-only (for archiving purposes). A dataspace should be able to model any kind of relationship between two (or more) members. On the more traditional end, we should be able to design that one participant is a view or a replica of another, or to specify a s important schema mapping among two members. We would, however, like to model a much broader set of relationships such as, that source A was manually derived from sources B and C, or that were created independently, but reflect the same system (e.g., mouse DNA). Relationships may be even specific, such as that two datasets came from the same source at the same time.

II. THE PROBLEM

A. Vulnerabilities and the Threat Model

In a typical information brokering centric, there are many types of stakeholders, namely data owners ,data providers ,and data requestors. Each stakeholder has its own privacy: (1) the privacy of a data owner (e.g., a patient in RHIO) is the identifiable data and sensitive or personal information carried (e.g.,medical records). Data owners usually sign strict privacy agreements with data providers to prevent unauthorized use or disclosure.(2) Data providers store the collected data logically and create two types of metadata ,namely routing metadata and access control metadata, for data. Both metadata are considered secured of a data provider.(3)Data request or may reveal identifiable or private information (e.g., information specifying her interests)in the querying content. For example, a query about AIDS treatment reveals the (possible) disease of the requestor.

We adopt the semi-honest [2] assumption for the brokers, and assume two types of adversaries, external attackers and curious corrupted brokering components. External attackers passively eavesdrop communication channels. Curious or corrupted brokering components, while following the protocols properly to fulfill brokering functions, try their best to infer sensitive or private information from the querying process. Privacy concerns arise when identifiable information is disseminated with no or poor disclosure control. For example, when data provider pushes routing and access control metadata to the local broker [6], a curious or corrupted broker learns query content and query location by intercepting a local query, routing metadata and access control metadata of local data servers and from other brokers, and data location from routing metadata it holds. Existing security mechanisms focusing on confidentiality and integrity cannot preserve privacy effectively. For instance, while data is protected over encrypted communication, external attackers still learn query location and data location from eavesdropping. Combining types of unintentionally disclosed information, the attacker could further infer the privacy of different take holders through attribute-correlation attacks and inference attacks

III. BACKGROUND

Related Work Research areas such as data, peer-to-peer file sharing and publish-subscribe provide partial solutions to the problem of large-scale data sharing. Information approaches focus on providing an integrated view overall large number of heterogeneous data sources by exploiting the semantic relationship between schemas of different sources[5]. The PPIB study assumes that a global schema exists with in the consortium, therefore, information integration is out of systems are designed to share files and datasets (e.g., in collaborative science applications). Distributed hash table technology [16] is adopted to locate replicas based on keyword. However, although such technology has been extended to support range queries[8],the coarse granularity (e.g., files and documents) cannot meet the expressiveness needs of applications focused in this work. Further- more, P2P systems often returns an incomplete set of answers while we need to locate all relevant data in the IBS. Addressing a conceptually dual problem, XML publish-sub- scribe systems(e.g.,[9]) are probably the closely related technology to the proposed research problem: while PPIB aims to locate relevant data sources for a given query and route the query to the data sources, the pub/sub systems locater elevant consumers of document and route the document to these customers. Due to this duality, we have different concerns. The pub/sub systems focus more on efficiently delivering the same piece of information to a large number of consumers, while we are trying to route a large volume but small-sized. Accordingly, the multicast solution in pub/sub systems does not scale in our environment and we need to develop new mechanisms.

One idea is to build an XML overlay architecture that supports expressive query processing and security checking a top normal IP network. In particular, specialized data structures are maintained on overlay nodes to route XML queries. In [5], a robust mesh has been built to effectively route XML packets by making use of self-describing XML tags and the overlay networks. Kouds et al. also proposed a decentralized architecture for ad hoc XPath query routing across a collection of XML databases [6]. To share data among a large number of autonomous nodes, [2] studied content-based routing for path queries in peer-to-peer systems. Different from these approaches, PPIB seamlessly integrates query routing with security and privacy protection. Privacy concerns arise in interorganizational information brokering since one can no longer assume brokers controlled by other organizations are fully trustable. As the major source that may cause privacy leak is the metadata (i.e., indexing and access control), secure index based search schemes [2] may be adopted to outsource metadata in encrypted form to untrusted brokers. Brokers are assumed to enforce security check and make routing decision without knowing the content of both query and metadata rules. Various protocols have been proposed for searchable encryption , however, to the best of our knowledge, all the schemes presented so far only support keyword search based on exact matching.

A. Automaton Segmentation

In the context of distributed information brokering, multiple organizations join a consortium and agree to share the data within the consortium. While different organizations may have different schemas, we assume a global schema exists by aligning and merging the local schemas. Thus, the access control rules and index rules for all the organizations can be crafted following the same shared schema and captured by a global automaton. The key idea of automaton segmentation scheme is to logically divide the global automaton into multiple

independent yet connected segments, and physically distribute the segments onto different brokering components, known as coordinators.

1) Segmentation: The atomic unit in the segmentation is an NFA state of the original automaton. Each segment is allowed to hold one or several NFA states. We further define the granularity level to denote the greatest distance between any two NFA states contained in one segment. Given a granularity level , for each segmentation, the next states will be divided

into one segment with a probability . Obviously, with a larger granularity level, each segment will contain more NFA states, resulting in less segments and smaller end-to-end overhead

in distributed query processing. However, a coarse partition is more likely to increase the privacy risk. The trade-off between the processing complexity and the degree of privacy

should be considered in deciding the granularity level. However, a coarse partition is more likely to increase the privacy risk. The trade-off between the processing complexity and the degree of privacy should be considered in deciding the granularity level. As privacy protection is of the primary concern of this work, we suggest a . To reserve the logical connection between the segments after segmentation, we define the following heuristic segmentation rules: (1) NFA states in the same segment should be connected via parent-child links; (2) sibling NFA states should not be put in the same segment without their parent state; and (3) the “accept state” of the original global automaton should be put in separate segments. To ensure the segments are logically connected, we also make the last states of

each segment as “dummy” accept states, with links pointing to the segments holding the child states of the original global automaton.

Algorithm 1 The automaton segmentation algorithm:

Deploy segment

Input: Automaton State

Output: Segment Address:

1: for each symbol in S.StateTransTable do

2: addr= deploySegment

3: (S.StateTransTable(K).nextState)

4: DS= createDummyAcceptState()

5: DS.nextState addr

6: end for

7: Seg = createSegment()

8: Seg.addSegment(S)

9:return Segment.address

We employ physical brokering servers, called coordinators, to store the logical segments. To reduce the number of needed coordinators, several segments can be deployed on the same coordinator using different port numbers. Therefore, the tuple uniquely identifies a segment. For the ease of presentation, we assume each coordinator only holds one segment in the rest of the article. After the deployment, the coordinators can be linked together according to the relative position of the segments they store, and thus form a tree structure. The coordinator holding the root state of the global automaton is the root of the coordinator tree and the coordinators holding the accept states are the leaf nodes. Queries are processed along the paths of the coordinator tree in a similar way as they are processed by the global automaton:

starting from the root coordinator, the first XPath step (token) of the query is compared with the tokens in the root coordinator. If matched, the query will be sent to the next coordinator, and so on so forth, until it is accepted by a leaf coordinator and then forwarded to the data server specified by the outpointing link of the leaf coordinator. At any coordinator, if the input XPath step does not match the stored tokens, the query will be denied and dropped immediately. Since all the queries are supposed to be processed first by the root coordinator, it becomes a single point of failure and a performance bottleneck. For robustness, we need

to replicate the root coordinator as well as the coordinators at higher levels of the coordinator tree. Replication has been extensively studied in distributed systems. We adopt the passive

path replication strategy to create the replicas for the coordinators along the paths in the coordinator tree, and let the centralized authority to create or revoke the replicas (please see more details in Section V). The CA maintains a set of replicas for each coordinator, where the number of replicas is either a preset value or dynamically adjusted based on the average queries passing through that coordinator.

CONCLUSION

With little attention drawn on privacy of user, data, and metadata during the design stage, existing information brokering systems suffer from a spectrum of vulnerabilities associated

with user privacy, data privacy, and metadata privacy. In this paper, we propose PPIB, a new approach to preserve privacy in XML information brokering. Through an innovative automaton segmentation scheme, in-network access control, and query segment encryption, PPIB integrates security enforcement and query forwarding while providing comprehensive privacy protection. Our analysis shows that it is very resistant to privacy attacks. End-to-end query processing performance and system scalability are also evaluated and the results show that PPIB is efficient and scalable. Many directions are ahead for future research. First, at present, site distribution and load balancing in PPIB are conducted in an ad-hoc manner. Our next step of research is to design an automatic scheme that does dynamic site distribution.

Several factors can be considered in the scheme such as the workload at each peer, trust level of each peer, and privacy conflicts between automaton segments. Designing a scheme that can strike a balance among these factors is a challenge. Second, we would like to quantify the level of privacy protection achieved by PPIB. Finally, we plan to minimize (or even eliminate) the participation of the administrator node, who decides such issues as automaton segmentation granularity to avoid data leakage.

REFERENCES

[1] R. Agrawal and J. Kiernan. Watermarking relational databases.

In VLDB ’02: Proceedings of the 28th international conference on Very

Large Data Bases, pages 155–166. VLDB Endowment, 2002.

[2] P. Bonatti, S. D. C. di Vimercati, and P. Samarati. An algebra for

composing access control policies. ACM Trans. Inf. Syst. Secur.,

5(1):1–35, 2002.

[3] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A

characterization of data provenance. In J. V. den Bussche and

V. Vianu, editors, Database Theory - ICDT 2001, 8th International

Conference, London, UK, January 4-6, 2001, Proceedings, volume 1973

of Lecture Notes in Computer Science, pages 316–330. Springer, 2001.

[4] P. Buneman and W.-C. Tan. Provenance in databases. In SIGMOD

’07: Proceedings of the 2007 ACM SIGMOD international conference

on Management of data, pages 1171–1173, New York, NY, USA,

2007. ACM.

[5] Y. Cui and J. Widom. Lineage tracing for general data warehouse

transformations. In The VLDB Journal, pages 471–480, 2001.

[6] S. Czerwinski, R. Fromm, and T. Hodes. Digital music distribution

and audio watermarking.

[7] F. Guo, J. Wang, Z. Zhang, X. Ye, and D. Li. Information Security

Applications, pages 138–149. Springer, Berlin / Heidelberg, 2006.

An Improved Algorithm to Watermark Numeric Relational Data.

[8] F. Hartung and B. Girod. Watermarking of uncompressed and

compressed video. Signal Processing, 66(3):283–301, 1998.

[9] S. Jajodia, P. Samarati, M. L. Sapino, and V. S. Subrahmanian.

Flexible support for multiple access control policies. ACM Trans.

Database Syst., 26(2):214–260, 2001.

[10] Y. Li, V. Swarup, and S. Jajodia. Fingerprinting relational

databases: Schemes and specialties. IEEE Transactions on Dependable

and Secure Computing, 02(1):34–45, 2005.

[11] B. Mungamuru and H. Garcia-Molina. Privacy, preservation and

performance: The 3 p’s of distributed data management. Technical

report, Stanford

上一篇:Software Requirement Specifications (SRS) 下一篇:A Novel Model for Mining Association Rules from Semantic Web data