With the size and scope of collected data expanding with each passing day, it is very important to continually analyze available data to drive better-informed business and policy decisions. To a large extent, handwritten data remains unexplored and unanalyzed. If we can analyze handwritten text data, we can minimize the hurdles and save the manpower involved in digitizing handwritten data.
In this blog, we cover how handwritten text data can be processed using Python and Google Cloud Vision. Cloud vision offers pre-trained ML models which are very powerful, and we do not need to do any pre-training.
Below is an example of converting simple handwritten text into digital words, which can be easily ingested into a CSV file or database.
should be converted to 45789
should be converted to casino
Using the applications listed in the instructions below, we are able to convert the scanned images or PDF files into digital text. Our approach for this solution requires us to first convert PDF documents to an image format. As we are processing images, it is better to convert all the images to the same size. We then need to define the source from which we are extracting the data from the image. This requires knowing the coordinates of the data to be extracted.
images = convert_from_path(file)
for img in images:
img.save(path + "\\" + fileName + '.jpg', 'JPEG')
image = cv2.imread(filename)
If it is a digital document or data, we can use Tesseract or pytesseract to convert the digital data into fields. As we are dealing with handwritten text, Tesseract will not work for us to extract data, but we can use it to get coordinates. Let us get the coordinates for the fields using python commands instead of using UI application.
data = pytesseract.image_to_data(image, output_type=Output.DICT)
keys = list(data.keys())
print(keys)
print(data['text'])
if re.match(accountIdReg, d['text'][i].lower()):
(x, y, w, h) = (
d['left'][i], d['top'][i], d['width'][i], d['height'][i])
idImageBox = (
x + w + 65, y - 20, x, y + h + 5)
# (1662, 2342)
idImage = img.crop(idImageBox)
# accountIdImage1.show()
idImage.save(cropped_dir + '\\id_Extracted_' + fileName + '.png')
At this point, we are going to use Handprint, a Python package to convert handwritten images into digital data which later can be saved to CSV data or a database. It also shows annotated images with the text recognized. Install Python’s Handprint application on your server and configuration is also needed to set up Google Cloud Vision and set up credentials to access the API.
handprint -a SERVICENAME CREDENTIALSFILE.json
os.system('handprint /a google C:\\Users\\ocr-keras\\cred.json')
images = glob.glob(cropped_dir + "\\" + "*.png")
for image in images:
os.system('handprint /s google /e ' + image)
Let us review the annotated images with the extracted information.
With this, we can confirm that extracted data is accurate. The process above also generates Json files with the extracted data.
dictFields = {}
for field in fileds:
jsonFiles = glob.glob(cropped_dir + "\\" + field + "*.json")
for jsonFile in jsonFiles:
with open(jsonFile) as f:
temp = json.load(f)
distros_dict = json.loads(temp)
print(distros_dict['text_annotations'][0]['description'])
dictFields[field] = distros_dict['text_annotations'][0]['description']
dataFrame = pd.DataFrame([dictFields])
print(dataFrame)
Below are the dataframe contents.
Id business
0 45789 casino
Save this dataframe to CSV format using the following.
dataFrame.to_csv(path + '\\' + fileName + '.csv', index=False)
As this example deals with only two images, the instructions did not include many validations. For a real-world application, analysts should validate the extracted data based on confidence levels and filter out the accurate data.
Handwritten image and text processing is very useful across many industries, including healthcare, insurance, and utilities. Various use cases include invoice processing, processing of inspection forms, employee onboarding, analyzing reviews and survey data and so on. One prescient application for utilities and other organizations relying on distributed field services is the digitization of handwritten notes taken by field technicians when completing repairs or regularly scheduled maintenance. By digitizing these notes, and even incorporating process automation to do so, utilities are better situated to retain critical handwritten information regarding asset health, maintenance history, and more.
About the Author
Manikanth Koora is a Senior Software Developer for Advanced Analytics at HEXstream. He has many years of experience in various programming languages, big data, machine learning, blockchain, and real-time data analytics. He completed his Master of Science in Information Technology at Southern New Hampshire University and has completed many certifications for big data and cloud technologies. Manikanth enjoys learning new technologies and watching TV shows.