Payloads And OCR With Solr & Lucene At Scale




Payloads and OCR with Solr & Lucene At Scale

1 October 2019


Added 01-Jan-1970

Our first talk will be from Eric Pugh and Dan Worley of OpenSource Connections on "Payloads + OCR: Easily search your PDFs with Solr".
Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses. Tonight you’ll learn how to pair both of these capabilities. This talk will be in two parts:
- How we used Tika framework plus Tesseract to easily OCR various PDF documents.
- A deep dive into the custom Payload component that we built to expose the power of Lucene’s payloads.

Eric Pugh is the co-founder of OpenSource Connections, co-author of the first ever book on Solr and an emeritus member of the Apache Software Foundation. Dan Worley, a Search Relevance Engineer at OpenSource Connections has been working in search since 2008, either directly leveraging Lucene or customizing Solr & Elasticsearch to meet the growing needs of customers.

Our second talk, from Florian Buetow of Mimecast, will be on “Lucene at Scale – Indexing Billions of Loglines each Day”: Indexing application logfiles at scale can be a challenge, especially when the velocity of the data is high and the volume is ever-increasing. This talk focusses on these and other challenges that we encountered and mastered trying to scale the distributed logfile search and indexing platform at Mimecast.
We will look at the architecture from the early beginnings with Hadoop and Elasticsearch and its transition to a custom system implemented with Lucene and Java and share the lessons learned. The platform currently handles billions of loglines and terabytes of data every day around the world.

Florian Buetow is a senior software engineer for search at Mimecast. With 15+ years of experience as a developer and project lead and 4+ years of experience working with search technologies he now specializes in distributed systems that can process petabytes of data and billions of documents per day using Lucene and Java. He is a co-organizer of the Machine Learning Meetups at Mimecast and enjoys photography in his free time.