Intelligent Information Extraction from Unstructured Data with Verisk Analytics
Written by Verisk Analytics
Aditi Nayak, Software Engineer at Verisk Analytics
Verisk Analytics is focused on tech and data organization. Their story is one of growth and innovation, introducing new, innovative solutions for customers. 50 years ago, Verisk was a US insurance rating, advisory organization, and nonprofit. They have evolved into an agile global company, making acquisitions, acquiring key data sets, adding key capabilities, and expanding into new verticals. At Verisk, people drive innovation to provide a great customer experience. People are critical to delivering consistent and solid results. Verisk has a diverse and passionate global team of more than 9,000.
Verisk is passionate about data but also committed to corporate social responsibility. That includes inclusion, diversity, and belonging initiatives. Being a partner of Women Who Code is one of them. They are headquartered in Jersey City and have offices across the US. Visit their website to see open positions.
Aditi Nayak, Software Engineer at Verisk will be talking about intelligent information extraction from unstructured data and share with you some technology tips that you can apply in your day-to-day work.
Aditi: We are a leading data analytics provider in multiple sectors. One of the analytical products that we have been working on is intelligent information extraction from unstructured data, also known as the machine. Machine is a platform that is being built in Verisk which allows users to create new customized databases based on various inputs, such as news, social media, and other structured and unstructured data. It further allows the users to interact with this information in terms of automated alerts and dashboards. Machine supports targeted searches, keyword detection, and highlighting.
Machine can be integrated with various downstream applications to provide predictive insights and analytics in real-time. Machine can also be used to enhance applications and can be easily plugged into other applications. It accepts a lot of different kinds of data, be it structured data, unstructured data, or unstructured media. Once this data is ingested, machine will try to run different kinds of machine learning models.
With machine, what we are primarily looking to do is to extract context. Context is important to the extractions. The market competitors do have good off-the-shelf models, but they do not provide context. For example, if person A talks to person B about an insurance policy, it is important to know that person A talked about what policy, and in what tone to person B. Machine is helping users extract that context. There are other types of information also that we try to extract. For example, in Miami, there was a building collapse incident. It was important for insurers to know why it occurred. What kind of event was it? Was it a natural calamity or was it due to human error? How many deaths were involved? How many injuries were reported? This kind of information is helping insurers assess the risk of such buildings.
The other thing that makes us different from other market competitors is that Verisk is a leading analytical provider across multiple domains. This domain knowledge is helping us build models which are able to extract information that is useful to our clients. This domain coupled with advanced analytics, makes machine enhance your downstream applications.
It helps analysts to see and glance through data and make their observations. This is helping an analyst in different insurance companies to assess the risk of different entities in similar regions. Machine accepts different kinds of data, structured, unstructured, and unstructured media. Primarily, machine focuses on unstructured data from news articles, social media feeds, PDF files, and Excel files provided by the clients. It also ingests structured data like government and other proprietary data sets. The entire machine application is built into AWS cloud with a special focus on security and the integrity of client data.
Our ETL pipelines are built using AWS serverless technology called Lambda. Some of our pipelines are also built using ECS schedule task, which scales up when there is a load on the system and scales down when there is no load. This helps us to provide higher performance with cost savings. The main messaging mechanism between the processing and the storage modules is AWS’s queuing service called AWS SQS. We have also set up alerts in the system using a notification service from AWS called AWS SNS. Once this whole data is processed, the data then moves into Elastic Search. Elastic Search is the main component that powers up machine. It makes efficient queries and provides real-time search results for the queries that are made.
In the machine environment, there are different data sources that users can query, be it news, social media or files. We expect the user to provide a unique name to his scenario. He can identify his scenario search results from a scenario dashboard. By default, machine research is across all countries, all publishers, and all domains. The user is allowed to restrict his search to a particular country, publisher, and domain. Furthermore, you can assign a publisher priority to each of these. The values that a user can set are high, medium, and low. Based on these values the results are sorted in the scenario result screen. We allow the user to save his selections as a set. It is auto-populated for users the next time he tries to create a different scenario.
Users can remove duplicates from the search results. This is important if you’re trying to uniquely identify or look for unique events. A user can create a scenario that is social media feeds. Social media is important to us because it gives real-time information about an event happening around the world. In machine, you can select social media as a scenario source. Like news scenarios, you can apply tags to it. You can apply a saved query, you can build queries on the fly in this query builder box. You can also assign the time frame that you’re looking for a particular event in this data set.
We expect users to create what are called file configs. This will reduce their effort of re-uploading the files to provide different kinds of analysis. Once a file is uploaded and a file config is created, the same file config can be used to create multiple scenarios. You can use the same files to fire multiple queries or extract different kinds of information. In machine, we allow our users to create their own folders and upload their files into it.
Machine’s primary use is to extract information from the data that we have ingested. Our different models deploy and run on the data that is ingested from the ingestion pipeline. Once the data is ingested in the ingestion pipeline, a message is sent to an SQS queue. This message triggers an orchestrator Lambda. Based on the message type, the orchestrator Lambda writes it to an appropriate chi queue. These chi queues invariantly run or trigger different kinds of chi NLP models. These models are further triggering ECS service, which has actual models deployed in them.
Tags help users to create and categorize their search results into different buckets. There is a functionality called tag builder, which will allow you to build your own customizable tags. Tag builder will allow you to create tags based on user queries. You can give different keywords to each tag. There are different kinds of match types, be it exact, exact case, and fuzzy. Exact will search for the exact word in the text. Exact case will also look for the case as well as the exact word in the text. Fuzzy will find all the variations of a word. For example, produce, production, producing, etcetera. Since this whole process can be cumbersome. When you have to upload or apply multiple tags to a scenario, we can also upload tags to machine.
Export is important because it describes the process of how we offload the data to our clients. We send the data and enhance it through our downstream applications. There are field attributes such as, what do you want to export from machine environment, the frequency of export, and the file type. Once the export config is created, you write this config into a Dynamo database at regular intervals. A CloudWatch rule is triggered, which invariantly triggers an orchestrated Lambda.
This Lambda looks up, in the Dynamo database, all the export config for which the delivery is still pending. Based on the types of the configs it is returned into different queues. Each of these queues trigger an ECS task. The ECS task reads the export config from the Dynamo database. Based on the export config, it also reads the articles from the Elastic Search. Once the data and the extractions are aggregated, the ECS task will write the data in the format that the user has selected. Currently, machine supports two types of output status, JSON, and CSP files. The max number of articles that you can export in one export file, is 10000 articles.
We have discussed the three main modules that power up the whole machine. They are data ingestion, model extractions, and export.