Fork me on GitHub

Services

HSN is composed of different services. Each of them is responsible either for managing data (e.g. writing output to external storage, parsing URLs) or detect malicious content. All parameters, input and output data are not described here in detail. In fact, some of them have been omitted for the readability. Moreover, below you can find only the most commonly used services. If you want a detailed description of all services, you should look at the Data Contract (PDF) document.

Data processing

Services listed below process the input data and provide output based on them. They do not analyze or look for malicious activity, but they vary in a way they process data from simple URL parsing to downloading a whole website and running JavaScripts.

URL Normalizer (url_norm)

The only purpose of this service is to normalize URLs to a canonical form. Since URLs can be represented in multiple ways, or even deliberately obfuscated, it only makes sense to compare the normalized forms.

Input data

Service accepts any object that has url_original attribute.

Parameters

Service accepts one parameter - uri - which points to a file in a local file system that contains a list of links (one per line). Whitespaces at the end or start of the line are ignored. So are empty lines.

Appended attributes

Service appends attributes that are the outcome of the URL parsing, like:

url normalized

URL in its canonical form. This has the property that two URLs pointing to the same location (e.g. http://www.example.org and http://www.example.org/) have the same canonical form.

protocol

URL scheme name (e.g. maitlo, http etc.)

Output

Service does not produce any new objects.

Reporter (reporter)

Reporter service is used to save data associated with a current object to the CouchDB database. JSON document is generated using a JSON Template. This is a very minimalistic templating language and you can have a look at the /etc/hsn2/templates directory (in standard installation) to see some examples.

Input data

Service accepts any object. Remember that based on this object and JSON template an output will be generated. This means that all of the attributes used in the template must be already present in the object.

Parameters

There are two parameters to that service:

serviceName

Name that will be appended to the document. Standard naming convention is <job_id>:<object_id>[:serviceName].

template

JSON template that will be used to create a document. This file has to be present under the /etc/hsn2/templates directory.

Appended attributes

Service does not append any attributes. It only writes to the database.

Output

Service does not produce any new objects. It only writes to the database.

Webclient (webclient)

The HSN 2.0 web client provides functionality to visit a requested URL and download content of a web page (e.g. HTML, JavaScript, images). From the point of view of the server it emulates behavior of a normal web browser.

Input data

It takes any object that has url_original or url_normalized attribute. This URL will be used for processing. There are also other - not required - attributes that webclient takes into account when visiting a website.

Parameters

Service can be heavily parametrized. Few of this parameters are:

link_limit

Limit of all links that will be followed by the webclient. This controls the number of objects created for the new URLs.

processing_timeout

The maximum amount of time (in milliseconds) that the webclient can wait for all processing.

page_timeout

The maximum amount of time (in milliseconds) that the web client can wait for a resource to be downloaded.

Appended attributes

Service appends a lot of attributes. Most important of them are:

html_source

Original HTML source as sent by the server - a complete HTTP message body, with no processing performed by the client.

js_context_list

List of JavaScript contexts executed on a website. It will be used by the JavaScript Static Analyzer.

Output

Service produces new objects, most commonly, for outgoing links and embedded content. Outgoing links objects have url_original set and can be supplied to the webclient again. Details can be found in Data Contract.

Feeders

These services provide job input. They create an object to analyze, based on the user input.

File Feeder (feeder_list)

This feeder provides functionality to access a local or remote text file containing a list of URLs that should be processed. The text file must contain single URL per line (either LF or CR LF terminated). For each URL the feeder creates a new object.

Input data

Service does not accept any objects (it creates one based on parameters).

Parameters

Service accepts one parameter - uri - which points to a file in a local file system that contains a list of links (one per line). Whitespaces at the end or start of the line are ignored. So are empty lines.

Appended attributes

No attributes are appended to the current object.

Output

Service produces new object for every line in the file. This object has url_original set to the (trimmed) URL from the line.

Analyzers

Analyzers are services which look for malicious content. They are designed specifically to provide a verdict about the input data.

JavaScript static analyzer (js_sta)

JavaScript Static Analyzer service (which behavior is documented here) provides functionality to analyze JavaScript source code without executing it. The analysis is performed on chunks (contexts) of JavaScript code extracted by a web client.

Input data

Any object with js_context_list attribute (as created by the webclient).

Parameters

Most of the classification process performed by the analyzer cannot be parametrized from the workflow.

Appended attributes

Service appends attributes that are the outcome of the JavaScript static analysis, like:

js_classification

Classification produced by this services. This is a classification of all JavaScript contexts and can have one of the following values:

malicious

JavaScript is close to the “malicious” category in the training data set

obfuscated

JavaScript is close to the “obfuscated” category in the training data set

benign

JavaScript is considered not harmful or all suspicious contexts were whitelisted

unclassified

it was not possible to classify the code, usually because not enough ngrams were generated

js_sta_results

Result of the classification. Described in detail in Data Contract.

Output

Service does not produce any new objects.