A Proposed Solution: Identifying Sensitive Information as a Safety Measure Against Privacy Vulnerabilities Associated With Optical Character Recognition

Farah Arafa Mohammed

14 min read

Abstract

Optical character recognition (OCR) is a technology used to generate machine readable text from images and documents; some OCR applications store extracted text in cloud storage, which has been proven to be not 100% secure for storing sensitive information. Therefore, items including sensitive information should not be processed and have their text extracted and stored to preserve the user’s security, which is not applicable unless sensitive data is identified first. Based on the conducted research about this problem, the previous efforts, and what is currently available, this paper proposes a solution of identifying items including sensitive information, and preventing OCR applications that store extracted text in cloud storage from extracting text out of items including sensitive information. This research also tests the validity of the major part of the proposed solution, which is identifying items including sensitive data in the first place. To test the ability to identify sensitive data, a MobileNet neural network was trained four times to determine whether items include sensitive data. The results of testing MobileNet after the last training session demonstrated the validity of identifying sensitive information at a reliable level of accuracy in a short time, indicating promising results for the proposed solution if applied to a real OCR application in the presence of simple coding, including if-else statements.

1. Introduction:

Optical character recognition (OCR) is a technology used to convert images or scanned documents containing text into machine-readable text [1]. OCR applications use approaches such as rule-based methods and pattern-matching algorithms to identify text and make it machine-readable [2]. The input data is transmitted through an API from the customer’s location where the OCR system is located to the vendor’s location in the cloud, where the data extraction process takes place, then input and output data can be stored locally at the vendor’s location in the cloud [2]. The most common storage options are as follows: The data could be stored in a buffer only during the execution of the data extraction process; the vendor does not save the data after it has sent the output to the customer [2]. The data could also be temporarily stored in a cache; the data retention period depends on the cache memory capacity and configuration. Another possible scenario is the storage of data in a persistent storage layer such as a database or a cloud storage [2]. The longer the data is stored in a system the higher the risk gets. The last option (storage location like a file or a database for instance) has the highest risks because the data can remain stored for a longer duration [2]. Some OCR applications, Microsoft Lens for example, create a file on a cloud storage to store the extracted text, which may lead to unauthorized access to confidential and sensitive information, unlawful repurposing, or an infringement of the data storage limitation principle.

Fig (1) Data flow in OCR systems hosted in the cloud [2]

The Cloud Service Providers (CSPs) have promised to ensure the data security over stored data of cloud clients by using methods such as firewalls and virtualization [3]. These mechanisms would not provide complete data protection because of their vulnerabilities’ over the network and CSPs have full command of cloud applications, hardware, and client data [3]. Encrypting sensitive data before hosting can deserve data privacy and confidentiality against CSP; however, encryption schemes are usually impractical because of the huge amount of communication overheads over the cloud access patterns [3].

Cloud storage vulnerabilities can be classified into three categories: (i) network/resource based[5], which includes Denial of Service, Buffer Overflow, Virtual Machine Based Rootkit, Side Channel, Man in the Middle, Replay Attack, and Byzantine Failure; (ii) browser-based, which includes Cross Site Scripting, Malware, and XML Wrapping; and (iii) social networking-based, which includes Sybil Attack, Social Intersection Attacks, and Collusion Attack [4]. All the previous vulnerabilities exist in cloud storage and the majority of their mitigations (if exist) are either reliable in specific scenarios, apply to a single scenario (unreliable), or can be defeated [4], which makes cloud storage unsafe for OCR applications to store sensitive information in.

The visibility and public awareness of safe information and transaction-sharing remain limited; the urgency and behavior of users do not reflect a positive reaction to awareness related efforts [6]. Therefore, the proposed solution in this research takes the form of a technology tool.

Research goal

Based on the conducted research, OCR applications that store extracted text on cloud storage should not process images including sensitive information to achieve a reliable level of security, and for the applications not to process images including sensitive information, these images must be identified first. This research aims to propose a solution to potential violations associated with OCR applications and thoroughly test the major part of it, which is identifying sensitive data. The proposed solution is to classify the images by a neural network[7] that could be integrated with OCR applications so that the neural network’s page is the first page obligatorily navigated to the user.

In this solution, the neural network’s main task is to accurately differentiate between images including sensitive and non-sensitive information, and be integrated with OCR applications to automatically prevent OCRs from processing and extracting text of sensitive data.

2. Background:

2.1.Optical character recognition (OCR)

Optical character recognition is a field with many different applications, including financial, legal, healthcare, and invoice imaging.

2.2. Network Vulnerability

A network vulnerability is a security exposure that results from a flaw in design or implementation. A network vulnerability has the potential to trigger an unanticipated and undesirable action compromising the security of a network infrastructure. Thus, a network vulnerability, even in cases where deployment and implementation are done correctly, makes it impossible to stop an unauthorized user from gaining access to a network, altering it, and compromising data on it. Often, and particularly in situations where the vulnerability is related to software, it is anticipated that the vendor would release patches to address any found vulnerabilities [5].

2.3.Browser-Based Vulnerability

A web browser-based attack uses a web browser's flaws to steal user information or run malicious code. By exploiting flaws in web applications, these attacks may compromise user data, steal confidential information, or interfere with services. The way the website is operated could be used by attackers to target its end users. Through pop-ups and clickjacking, attackers inject websites with malicious code [9], and the end-user systems forward sensitive data to attackers.

Insecure habits, malevolent activities, and technical weaknesses all contribute to the occurrence of browser-based attacks.

2.4.Social Networking-Based Vulnerability

Numerous security and privacy issues are related to the user’s shared information particularly when the user uploads private files [10]. Shared information may bemaliciously used by the attacker for illegal reasons. Targeting vulnerable sectors of the community increases the hazards [10]. Privacy violations through social media may take many forms including contamination or accessing a computer system, acquiring sensitive and confidential information like username, password, and credit card details of a user through fake websites and emails, stealing identity, and so on. [10].

2.5. Image classification

A key aspect of computer vision is image classification, which is classifying images into one of several predetermined labels. It serves as the foundation for advanced computer vision tasks like segmentation, detection, and localization. Handcrafted features were first obtained from images by feature descriptors, and these were used as input to a trainable classifier [11]. The main problem with this strategy was that the feature extraction stage's design had a significant impact on the classification task's accuracy, and this was often a demanding task. These difficulties have been solved in recent years by deep learning models that take advantage of numerous layers of nonlinear information processing for feature extraction and modification, pattern analysis, and classification [11].

2.6. Neural Networks

Neural networks are a type of artificial intelligence that attempts to imitate the way a human brain works. Neural networks work by creating connections between processing elements, the computer equivalent of neurons. The organization and weights of the connections determine the output [12]. Neural network models are ubiquitous in the image data space. They work well on computer vision tasks like image classification, object detection, and image recognition. [13]. They have hence been widely used in artificial intelligence modeling.

2.6.1. Convolutional Layers

A neural network’s primary component is the convolutional layer. It includes several filters, the parameters of which are to be “learned” during the training. The filter's size is less than that of the original image. The convolutional layers are used as feature extractors; thus, they learn the feature representations of their input images [11]. To create a new feature map [14], inputs are convolved with the learned weights, and the convolved outputs are then sent via a nonlinear activation function.

2.7.MobileNet

The foundation of MobileNets is a streamlined architecture that creates lightweight deep neural networks using depth-wise separable convolutions [15]. The standard convolution filters feature based on the convolutional kernels and combine them to produce a new representation. However, depth-wise separable convolutions split the filtering and combination steps into two steps, which results in a tangibly reduced computational cost for the MobileNet neural network [15]. MobileNets possess high adaptability towards the tendency of human interaction that puts the effectiveness of computer vision the most into consideration [16]. Also, it is known to achieve high accuracy scores with substantially fewer parameters than those in standard models. Due to their high efficiency, MobileNets are used in solving a variety of problems including face attribute classification, fine-grained recognition, large-scale geolocalization, object detection, face embedding, and more [15].

2.8.Feature map

A feature map is a set of neurons that are connected to an input and work to generate an output that encodes features or patterns present in the input. This allows the neurons to extract particular features from the input [14]. Multiple feature maps are usually stacked together to create a hierarchy of features [14].

3. Methodology

In order to differentiate between images including sensitive information and non-sensitive information at a high accuracy and low computational cost and with a relatively small amount of data[15], [16], the MobileNet neural network (which lies under the Supervised learning category[17]) was used. The MobileNet model is based on depthwise separable convolutions [18] which factorize a standard convolution into a depthwise convolution [19] and a 1×1 convolution called pointwise convolution[20] [15]. A standard convolution both filters and combines the images into a new set of outputs in one step, and then the depthwise separable convolution splits this into two layers [15].

Figure (2) shows how a standard convolution is factorized into a depthwise convolution (3) and a 1 × 1 pointwise convolution (4).

Depthwise separable convolutions are included in MobileNet for their substantial reduction in computational cost, they also apply a single filter[21] per each input channel (input depth [22]). Then, a linear combination of the output of the depthwise layer is created by Pointwise convolution, a simple 1×1 convolution [15]. Depthwise convolution with one filter per input depth is computed as shown in the following equation:

where Kˆ is the depthwise convolutional kernel[23] of size Dk × Dk × M where the mth filter in Kˆ is applied to the mth channel in feature map F to produce the mth channel of the filtered output feature map Gˆ. In the training process of MobileNet, features of the raw training data (images of items including sensitive information and items that do not include sensitive information) were identified in a feature extraction process and then mapped to the intended labels (“sensitive information” and “non-sensitive information”) so that the model could recognize similar sets of features in the testing data and classify it accordingly. The MobileNet neural network was trained four times with a different amount of training data in each trial: In the first trial, the neural network was trained by 190 images labeled “Sensitive Information” and 190 labeled “Non-sensitive Information”. In the second trial, the neural network was trained by 290 images labeled “Sensitive Information” and 290 labeled “Non-sensitive Information”. In the third trial, the neural network was trained by 390 images labeled “Sensitive Information” and 390 labeled “Non-sensitive Information”. In the fourth trial, the neural network was trained by 490 images labeled “Sensitive Information” and 490 labeled “Non-sensitive Information”. The neural network did not show notable progress after the fourth trial, therefore no more training sessions were run. The data labeled “Sensitive information” included images of credit cards, debit cards, bank statements, income statements, transcripts, and login pages with credentials, while the “Non-sensitive Information” included images of everyday items, in addition to the images that the model could confuse with sensitive data (images including tables and numbers): Calendars, school public schedules, and boarding schools and hotels meal plans. Both types, sensitive and non-sensitive, included images in different positions, lightings, and backgrounds. A web application interface was built using JavaScript and HTML to integrate the model with it and simulate real-life experience, especially since OCR applications that store text in cloud storage are web-based.

Testing the neural network’s accuracy:

After each training session, the model was tested on 50 images including sensitive information and 50 including non-sensitive information, and the number of the images the model classified incorrectly in each trial was documented. Figures (5), (6), (7), (8) represent the model displaying its highest probability to some of the testing data.

The time in which the model classified testing data was recorded by a stopwatch. The accuracy of identifying sensitive data and non-sensitive data (either printed, in real life, or from a phone screen) in each trial was calculated by dividing the number of images the model classified correctly by 50 and then multiplying by one hundred as shown in the following equation: [(number of images classified correctly/50)*100].

4. Results:

Throughout the testing process, the number of training images for both classes was manipulated four times. As demonstrated in the following table, the recognition speed and accuracy of the model directly varied according to the amount of training data to a certain extent. In each trial, the model was tested on 50 images containing sensitive data and 50 containing non-sensitive data. It is worth noting that the change in the time spent in recognition according to the amount of training data was relatively minuscule.

5. Discussion:

The results show that the accuracy of the model generally increased by increasing the number of training images, which is true with most prediction models[24] since increasing the training data lessens the likelihood of overfitting [25].

In the final trial, the model classified 50 out of 50 items including non-sensitive information as “Non-Sensitive Information” and none as “Sensitive Information”, classifying each item in approximately 1 second and, out of the 50 images including sensitive information, the model classified 48 as “Sensitive Information” and 2 as “Non-Sensitive Information” classifying each image in approximately 2 seconds, indicating that identifying sensitive information at high efficiency is applicable. Below are some reasons that could explain these observations.

First, the training data sets consisted of sensitive and non-sensitive data in various positions, lightings, backgrounds, and situations, which may have allowed the model to accurately classify most of the testing data.

Second, the model did not classify uploaded images, instead, it classified testing data live, which only displays a final prediction once the testing item is stable, whereas if the model were to classify captured images, the user may upload shaky or low-quality images, altering the dimensions of images’ components and accordingly the accuracy.

It is worth noting that MobileNets generally provide higher accuracy with small training data than standard models.

6. Limitations:

The language in any text-including image, whether in the training or the testing data, is English; the same performance is not guaranteed with other languages.

The model was only trained on American and European formats of bank statements and income statements, which will likely result in a weaker performance with other formats.

Non-sensitive information data sets included images with cells and numbers, such as calendars, which the model could confuse with sensitive information (see 4.); nevertheless, there could be more confusing data to the model that was not considered.

The time the model spent to classify testing data could be recorded with a more advanced method to eliminate errors further.

If applied in real life, this model could only identify and help prevent sensitive information coming through OCR applications from being processed and stored on cloud storage; yet, a user can manually store an image of a debit card on the cloud. That is, the model will never be able to prevent all violations associated with cloud storage.

7. Conclusion:

In this paper, based on available information and previous attainments, I proposed a solution of classifying items into sensitive and non-sensitive information before scanning them with OCR applications that store extracted text on cloud storage for robust protection against privacy violations associated with the cloud. I also tested the main part of this proposed solution, which is differentiating between sensitive and non-sensitive information at a reliable level of accuracy, and kept records of the outcomes of all trials until a high level of efficiency was reached. Sensitive information in the testing data was identified in approximately 2 seconds per item with an accuracy that reached 98% in the fourth trial, demonstrating the validity of the major part of the proposed solution. This indicates that, if integrated with the MobileNet trained in this study with simple coding, including if else statements [26], OCR applications should be able to identify and prevent sensitive information from being processed and stored on the cloud and accordingly offer a reliable level of security for OCR applications users.

8. Recommendations:

For future work, I suggest implementing a user-friendly format for this solution: printing a message that states that the user is trying to scan sensitive information and that they should try with another item whenever the user tries to scan an item including sensitive information or swiftly proceeding to processing and extracting text out of the item in case the user tries scanning an item that does not include sensitive information. It is also recommended to have strong approaches to the limitations mentioned: include training data in different languages, train the model on different formats of bank statements and income statements, include more non-sensitive information that the model could conflate with sensitive information, and record the time spent in classifying each testing data with a more advanced method.

9. References

Cover image: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.laserfiche.com%2Fresources%2Fblog%2Fwhat-is-optical-character-recognition-ocr%2F&psig=AOvVaw11iujKDSYSkst2toH-9JEi&ust=1728272893750000&source=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCMiO4vLs-IgDFQAAAAAdAAAAABAJ

[1] N. A. M. Isheawy and H. Hasan, “Optical Character Recognition (OCR) System,” Opt. Character Recognit., vol. 17, no. 2, Apr. 2015.
[2] I. Barberá, “AI Possible Risks & Mitigations - Optical Character Recognition,” Opt. Character Recognit., Sep. 2023.
[3] N. Vurukonda and B. T. Rao, “A Study on Data Storage Security Issues in Cloud Computing,” Procedia Comput. Sci., vol. 92, pp. 128–135, 2016, doi:
10.1016/j.procs.2016.07.335.
[4] N. C. Rajasekar and C. O. Imafidon, “Exploitation of Vulnerabilities in Cloud-Storage,” GSTF Int. J. Comput., vol. 1, no. 2, 2011, doi: 10.5176/2010-2283_1.2.41. [5] O. Awodele, E. E. Onuiri, and S. O. Okolie, “Vulnerabilities in Network Infrastructures and Prevention/Containment Measures”.
[6] H. De Bruijn and M. Janssen, “Building Cybersecurity Awareness: The need for evidence based framing strategies,” Gov. Inf. Q., vol. 34, no. 1, pp. 1–7, Jan. 2017, doi: 10.1016/j.giq.2017.02.007.
[7] A. D. Dongare, R. R. Kharde, and A. D. Kachare, “Introduction to Artificial Neural Network,” vol. 2, no. 1, 2012.
[8] A. Singh, K. Bacchuwar, and A. Bhasin, “A Survey of OCR Applications,” Int. J. Mach. Learn. Comput., pp. 314–318, 2012, doi: 10.7763/IJMLC.2012.V2.137.
[9] P. Shital and C. R., “Web Browser Security: Different Attacks Detection and Prevention Techniques,” Int. J. Comput. Appl., vol. 170, no. 9, pp. 35–41, Jul. 2017, doi: 10.5120/ijca2017914938.
[10] A. K. Jain, S. R. Sahoo, and J. Kaubiyal, “Online social networks security and privacy: comprehensive review and analysis,” Complex Intell. Syst., vol. 7, no. 5, pp. 2157–2177, Oct. 2021, doi: 10.1007/s40747-021-00409-7.
[11] W. Rawat and Z. Wang, “Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review,” Neural Comput., vol. 29, no. 9, pp. 2352–2449, Sep. 2017, doi: 10.1162/neco_a_00990.
[12] M. Islam, G. Chen, and S. Jin, “An Overview of Neural Network,” Am. J. Neural Netw. Appl., vol. 5, no. 1, p. 7, 2019, doi: 10.11648/j.ajnna.20190501.12.
[13] M. M. Srivastava and P. Kumar, “Machine Learning approaches to do size-based reasoning on Retail Shelf objects to classify product variants,” Oct. 07, 2021, arXiv: arXiv:2110.03783. Accessed: Aug. 15, 2024. [Online]. Available:
http://arxiv.org/abs/2110.03783
[14] D. Huang, Q. Bu, Y. Qing, Y. Fu, and H. Cui, “Feature Map Testing for Deep Neural Networks,” Jul. 21, 2023, arXiv: arXiv:2307.11563. Accessed: Aug. 15, 2024. [Online]. Available: http://arxiv.org/abs/2307.11563[15] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017, arXiv. doi: 10.48550/ARXIV.1704.04861.
[16] B. Khasoggi, E. Ermatita, and S. Samsuryadi, “Efficient mobilenet architecture as image recognition on mobile and embedded devices,” Indones. J. Electr. Eng. Comput. Sci., vol. 16, no. 1, p. 389, Oct. 2019, doi: 10.11591/ijeecs.v16.i1.pp389-394.
[17] V. Nasteski, “An overview of the supervised machine learning methods,” HORIZONS.B, vol. 4, pp. 51–62, Dec. 2017, doi: 10.20544/HORIZONS.B.04.1.17.P05.
[18] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” 2016, arXiv. doi: 10.48550/ARXIV.1610.02357.
[19] Y. Guo, Y. Li, L. Wang, and T. Rosing, “Depthwise Convolution Is All You Need for Learning Multiple Visual Domains,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 8368–8375, Jul. 2019, doi: 10.1609/aaai.v33i01.33018368.
[20] B.-S. Hua, M.-K. Tran, and S.-K. Yeung, “Pointwise Convolutional Neural Networks,” 2017, arXiv. doi: 10.48550/ARXIV.1712.05245.
[21] K. N. Plataniotis, D. Androutsos, and A. N. Venetsanopoulos, “Multichannel filters for image processing,” Signal Process. Image Commun., vol. 9, no. 2, pp. 143–158, Jan. 1997, doi: 10.1016/S0923-5965(96)00021-5.
[22] T. Van Dijk and G. De Croon, “How Do Neural Networks See Depth in Single Images?,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 2183–2191. doi: 10.1109/ICCV.2019.00227. [23] C. Campbell, “Chapter 7 An Introduction to Kernel Methods”.
[24] A. R. Ajiboye, R. Abdullah-Arshah, H. Qin, and H. Isah-Kebbe, “EVALUATING THE EFFECT OF DATASET SIZE ON PREDICTIVE MODEL USING SUPERVISED LEARNING TECHNIQUE,” Int. J. Comput. Syst. Softw. Eng., vol. 1, no. 1, pp. 75–84, Feb. 2015, doi: 10.15282/ijsecs.1.2015.6.0006.
[25] X. Ying, “An Overview of Overfitting and its Solutions,” J. Phys. Conf. Ser., vol. 1168, p. 022022, Feb. 2019, doi: 10.1088/1742-6596/1168/2/022022.
[26] S. Nurollahian, M. Hooper, A. Salazar, and E. Wiese, “Use of an Anti-Pattern in CS2: Sequential if Statements with Exclusive Conditions,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, Toronto ON Canada: ACM, Mar. 2023, pp. 542–548. doi: 10.1145/3545945.3569744.

A Proposed Solution: Identifying Sensitive Information as a Safety Measure Against Privacy Vulnerabilities Associated With Optical Character Recognition

Recent Posts