- Posted by redglue
- On August 28, 2017
- 0 Comments
- gdpr, machine learning, redatasense
We have been working on a project that we want to call it Redatasense. It was born after our team work on the original open source project called DataDefender . After a few patches on the original project to fix “the basics”, we have decided to fork it and move the code to another direction that is focus on finding sensitive data on production environments.
Redatasense is built on Java 8, Apache Tika and Natural Language Processing API called OpenNLP that uses a bunch of Machine Learning models to search for data.
Version 1.0 is out and supports OpenNLP Maximum Entropy modeling, OpenNLP Dictionary Tokenize and Search and also OpenNLP Regex search on every single datasource you may have.
It features Database Column Discovery, Database Data Discovery and File Discovery.
On File Discovery it reads inside your documents (Word, Excel, Powerpoint, CSV, Text, PDF, etc) and apply the models to search for data.
On Database, it reads column names and also sample data in your tables to identify possible data.
The only thing that is not open source is the OpenNLP models for Portuguese language and the Dictionaries.
We have been using it in production environments for GDPR projects to find sensitive personal data.
PS: A full disclaimer why we fork it is available at: https://github.com/redglue/redsense#Disclamer