Web Scraper Technology

We have developed our own powerful web scraper technology as a part of our portal solutions. It has been used for a long time in our portals in real-time mode. Surely we can use it for your tasks.
As the technology is ready, we don’t need to spend a lot of time writing it from scratch. All we need is to write plugins for source sites and for each output format.
It will drastically reduce the project term and save the costs.
This scraper can collect different data and store it to a customized target format (like CSV or Excel).
The technology is mature and highly customizable, it has its own database with history of already scraped data that allows reconciling the data; checking history of the data which were scraped and version history of each entity spidered; checking if similar data already exist and so on.

Technology possibilities:


Automatical content classification

We have technology of automatic classification of content. You can define directory of terms (including multi word terms), and input data will be automatically classified, and even related terms could be selected by tags.
For examle, for our medical portal. We have news that are crowled periodically. After crowling they are classified, and terms are hyperlinks to the term page.
For example: here is a list of terms in terms admin tool:


Here is a page with news, getted by scrapper and classified:http://www.med-life.com/news/pochemu-krayne-neobhodimo-kushat-morskuyu-ryibu-02042014

You can see hyperlinks in the text. They are automatically created. Also automatically news was classified us related to «Атеросклероз» (in cyrillic, sorry), and link to this term’s page was created under the news text. If you press on this link you can see the backreference link to this news from the term page.
Also you can see, we can classify also phrases, not only single words. You can see one two-word link — «рыбий жир».
Also you may see that not all of the links are reflected under the bottom of the news. It’s because of type of terms. We have terms of type «desease» and «substance». And we define different processing in dependence of what is the type of the term.
So, we can classify crowled content automatically, and do any actions on classified content.

Examples:

1) Scrapers for medical portal

Portal url: http://www.med-life.com/
This is a medical portal with news and medical information. News and medical instructions are scraped (two scrapers, as different types of entities)

Features of the news scraper:

  • News from 2 sources are scraped (look results at main page : http://www.med-life.com/)
  • News are aggregated in the same table in the database (2 inputs stored to 1 output)
  • As news can be similar for two sources, there is a similarity check.
  • News from both sources are scraped each hour
  • Black list: news at a source site sometimes has an «original site» field. There is a list of original sites the data from which are prohibited for scraping. Such news are omitted.
  • News are updated if the source data have been changed (reconciliation)
  • History of all versions for each entity is stored
  • Log of scraped data is stored
  • You can manually rescrap all or selected entities
  • Supports list based (for news) and alphacatalog (for instructions) scrapers

 

Features of instructions scraper:

You can see the instructions scraped by this scraper at http://www.med-life.com/drugalphadir
This scraper is of alphanumeric type and intended for scraping alphabetic catalogs (in opposite to pagered list based source web data supported by previous scraper) with the ability to refresh by letter

2) Scraper for auto portal

Portal url: http://www.inautoclub.net/
Here you can see:

 

Terms and Cost


With such technology we can scrap up to 6 sites per day, depending on complexity of the data and site structure. We assume that the sites has a normal structure and data of one type stored in the same way. Otherwise estimation can differ.

So, for each source site we can estimate the required time before start and you will know the cost before.

Have something to scrap? Want to update it periodically? Have other questions?

Contact Us
Top