Special Webcast: SANS Dark Web Solutions Forum – Illuminating the Dark Web: Harvesting and Using OSINT Data from Dark Web Resources – November 15, 2019 8:30am US/Eastern

This post was originally published on this site

Speakers: Micah Hoffman

{{!In the Boston area? Join us at the Live Event. Register here.}}

Listening to the news nowadays, we hear how the “Dark Web” is used to conduct illicit activities and how import it is to search for data that your organization or your customers deem important within those Dark Web resources. But in the very next sentence, many of these reports caution regular users to not visit Dark Web sites because of a fear that their computer applications and systems might be attacked and possibly compromised. So, how do we get at this valuable data on the Dark Webs without visiting the sites ourselves?

This live simulcast will showcase methods of retrieving, searching, and analyzing Dark Web data without visiting those, potentially malicious systems. Through customer examples, participants will leave the briefing with a good understanding of techniques and tools that can assist their organizations and clients with using Dark Web data.

Topics will include:

  • Data collection
  • De-anonymization techniques
  • Data aggregation and normalization
  • Dark Web user activity analysis
  • Usage trends among and within Dark Webs
  • Dark Web analyst security/OPSEC
  • Dark Web content monitoring and alerting

Earn 4 CPE Credit hours for attending this event.

Agenda: TBD

New – Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi

This post was originally published on this site

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top of that, you can leverage Amazon EMR to process and analyze your data using open source tools like Apache Spark, Hive, and Presto. As powerful as these tools are, it can still be challenging to deal with use cases where you need to do incremental data processing, and record-level insert, update, and delete.

Talking with customers, we found that there are use cases that need to handle incremental changes to individual records, for example:

  • Complying with data privacy regulations, where their users choose to exercise their right to be forgotten, or change their consent as to how their data can be used.
  • Working with streaming data, when you have to handle specific data insertion and update events.
  • Using change data capture (CDC) architectures to track and ingest database change logs from enterprise data warehouses or operational data stores.
  • Reinstating late arriving data, or analyzing data as of a specific point in time.

Starting today, EMR release 5.28.0 includes Apache Hudi (incubating), so that you no longer need to build custom solutions to perform record-level insert, update, and delete operations. Hudi development started in Uber in 2016 to address inefficiencies across ingest and ETL pipelines. In the recent months the EMR team has worked closely with the Apache Hudi community, contributing patches that include updating Hudi to Spark 2.4.4 (HUDI-12), supporting Spark Avro (HUDI-91), adding support for AWS Glue Data Catalog (HUDI-306), as well as multiple bug fixes.

Using Hudi, you can perform record-level inserts, updates, and deletes on S3 allowing you to comply with data privacy laws, consume real time streams and change data captures, reinstate late arriving data and track history and rollbacks in an open, vendor neutral format. You create datasets and tables and Hudi manages the underlying data format. Hudi uses Apache Parquet, and Apache Avro for data storage, and includes built-in integrations with Spark, Hive, and Presto, enabling you to query Hudi datasets using the same tools that you use today with near real-time access to fresh data.

When launching an EMR cluster, the libraries and tools for Hudi are installed and configured automatically any time at least one of the following components is selected: Hive, Spark, or Presto. You can use Spark to create new Hudi datasets, and insert, update, and delete data. Each Hudi dataset is registered in your cluster’s configured metastore (including the AWS Glue Data Catalog), and appears as a table that can be queried using Spark, Hive, and Presto.

Hudi supports two storage types that define how data is written, indexed, and read from S3:

  • Copy on Write – data is stored in columnar format (Parquet) and updates create a new version of the files during writes. This storage type is best used for read-heavy workloads, because the latest version of the dataset is always available in efficient columnar files.
  • Merge on Read – data is stored with a combination of columnar (Parquet) and row-based (Avro) formats; updates are logged to row-based “delta files” and compacted later creating a new version of the columnar files. This storage type is best used for write-heavy workloads, because new commits are written quickly as delta files, but reading the data set requires merging the compacted columnar files with the delta files.

Let’s do a quick overview of how you can set up and use Hudi datasets in an EMR cluster.

Using Apache Hudi with Amazon EMR
I start creating a cluster from the EMR console. In the advanced options I select EMR release 5.28.0 (the first including Hudi) and the following applications: Spark, Hive, and Tez. In the hardware options, I add 3 task nodes to ensure I have enough capacity to run both Spark and Hive.

When the cluster is ready, I use the key pair I selected in the security options to SSH into the master node and access the Spark Shell. I use the following command to start the Spark Shell to use it with Hudi:

$ spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
              --conf "spark.sql.hive.convertMetastoreParquet=false"
              --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

There, I use the following Scala code to import some sample ELB logs in a Hudi dataset using the Copy on Write storage type:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/"
val hudiTableName = "elb_logs_hudi_cow"
val hudiTablePath = "s3://MY-BUCKET/PATH/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "request_ip",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "request_verb", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "request_timestamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "request_verb", 
    DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write
       .format("org.apache.hudi")
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(hudiTablePath)

In the Spark Shell, I can now count the records in the Hudi dataset:

scala> inputDF2.count()
res1: Long = 10491958

In the options, I used the integration with the Hive metastore configured for the cluster, so that the table is created in the default database. In this way, I can use Hive to query the data in the Hudi dataset:

hive> use default;
hive> select count(*) from elb_logs_hudi_cow;
...
OK
10491958
...

I can now update or delete a single record in the dataset. In the Spark Shell, I prepare some variables to find the record I want to update, and a SQL statement to select the value of the column I want to change:

val requestIpToUpdate = "243.80.62.181"
val sqlStatement = s"SELECT elb_name FROM elb_logs_hudi_cow WHERE request_ip = '$requestIpToUpdate'"

I execute the SQL statement to see the current value of the column:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_003|
+------------+

Then, I select and update the record:

// Create a DataFrame with a single record and update column value
val updateDF = inputDF.filter(col("request_ip") === requestIpToUpdate)
                      .withColumn("elb_name", lit("elb_demo_001"))

Now I update the Hudi dataset with a syntax similar to the one I used to create it. But this time, the DataFrame I am writing contains only one record:

// Write the DataFrame as an update to existing Hudi dataset
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I check the result of the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Now I want to delete the same record. To delete it, I pass the EmptyHoodieRecordPayload payload in the write options:

// Write the DataFrame with an EmptyHoodieRecordPayload for deleting a record
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
                "org.apache.hudi.EmptyHoodieRecordPayload")
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I see that the record is no longer available:

scala> spark.sql(sqlStatement).show()
+--------+                                                                      
|elb_name|
+--------+
+--------+

How are all those updates and deletes managed by Hudi? Let’s use the Hudi Command Line Interface (CLI) to connect to the dataset and see now those changes are interpreted as commits:

This dataset is a Copy on Write dataset, that means that each time there is an update to a record, the file that contains that record will be rewritten to contain the updated values. You can see how many records have been written for each commit. The bottom line of the table describes the initial creation of the dataset, above there is the single record update, and at the top the single record delete.

With Hudi, you can roll back to each commit. For example, I can roll back the delete operation with:

hudi:elb_logs_hudi_cow->commit rollback --commit 20191104121031

In the Spark Shell, the record is now back to where it was, just after the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Copy on Write is the default storage type. I can repeat the steps above to create and update a Merge on Read dataset type by adding this to our hudiOptions:

DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ"

If you update a Merge on Read dataset and look at the commits with the Hudi CLI, you can see how different Merge on Read is compared to Copy on Write. With Merge on Read, you are only writing the updated rows and not whole files as with Copy on Write. This is why Merge on Read is helpful for use cases that require more writes, or update/delete heavy workload, with a fewer number of reads. Delta commits are written to disk as Avro records (row-based storage), and compacted data is written as Parquet files (columnar storage). To avoid creating too many delta files, Hudi will automatically compact your dataset so that your reads are as performant as possible.

When a Merge On Read dataset is created, two Hive tables are created:

  • The first table matches the name of the dataset.
  • The second table has the characters _rt appended to its name; the _rt postfix stands for real-time.

When queried, the first table return the data that has been compacted, and will not show the latest delta commits. Using this table provides the best performance, but omits the freshest data. Querying the real-time table will merge the compacted data with the delta commits on read, hence this dataset being called “Merge on Read”. This will result in the freshest data being available, but incurs a performance overhead, and is not as performant as querying the compacted data. In this way, data engineers and analysts have the flexibility to choose between performance and data freshness.

Available Now
This new feature is available now in all regions with EMR 5.28.0. There is no additional cost in using Hudi with EMR. You can learn more about Hudi in the EMR documentation. This new tool can simplify the way you process, update and delete data in S3. Let me know which use cases are you going to use it for!

Danilo

What’s New in the vRealize Log Insight Content Pack for vSAN

This post was originally published on this site

VMware vRealize Log Insight is masterful in its ability to take large amounts of unstructured log data that is all too often ignored, and give it practical meaning for a data center administrator. vSAN customers benefit from this too, with a large assortment of event log dashboards that are purpose-built for vSAN. These dashboards come

The post What’s New in the vRealize Log Insight Content Pack for vSAN appeared first on Virtual Blocks.

VMware Customers back formation of a VMware User Group (VMUG) for Telco Cloud

This post was originally published on this site

At a special invitation-only user forum held on November 5th alongside VMworld Europe 2019, over 20 representatives from 10 Communication Service Provider (CSP) customers began the process of building the first VMware User Group (VMUG) community specifically focused on supporting telecom infrastructure projects for NFV and Telco Cloud. VMUG, an independent and customer-led organisation, helps […]

The post VMware Customers back formation of a VMware User Group (VMUG) for Telco Cloud appeared first on VMware Telco NFV Blog.

VMware Offers Free VCP Exam Vouchers Across the Globe

This post was originally published on this site

VMware extended and added new offers this month to help you deepen your skills, get certified, and have the opportunity to train more of your team! Find out what’s available in your region and book an eligible class today. Europe, Middle East & Africa If you’re planning on deploying VMware NSX-T Data Center, then you… Read More »

Improving NIC and switch performance for vSAN (and other IP storage)

This post was originally published on this site

This is going to be a short post collecting a few tricks to unlock some bottlenecks in storage networking that may grow over time: Unfortunetly a lot of troubeshooting of networking performance stops earlier than it should. Two common incomplete troubleshooting workflows I’ve seen: Someone checks that network utilization on a host isn’t near the […]

The post Improving NIC and switch performance for vSAN (and other IP storage) appeared first on Virtual Ramblings.

New Automation Features In AWS Systems Manager

This post was originally published on this site

Today we are announcing additional automation features inside of AWS Systems Manager. If you haven’t used Systems Manager yet, it’s a service that provides a unified user interface so you can view operational data from multiple AWS services and allows you to automate operational tasks across your AWS resources.

With this new release, it just got even more powerful. We have added additional capabilities to AWS Systems Manager that enables you to build, run, and share automations with others on your team or inside your organisation — making managing your infrastructure more repeatable and less error-prone.

Inside the AWS Systems Manager console on the navigation menu, there is an item called Automation if I click this menu item I will see the Execute automation button.

When I click on this I am asked what document I want to run. AWS provides a library of documents that I could choose from, however today, I am going to build my own so I will click on the Create document button.

This takes me to a a new screen that allows me to create a document (sometimes referred to as an automation playbook) that amongst other things executes Python or PowerShell scripts.

The console gives me two options for editing a document: A YAML editor or the “Builder” tool that provides a guided, step-by-step user interface with the ability to include documentation for each workflow step.

So, let’s take a look by building and running a simple automation. When I create a document using the Builder tool, the first thing required is a document name.

Next, I need to provide a description. As you can see below, I’m able to use Markdown to format the description. The description is an excellent opportunity to describe what your document does, this is valuable since most users will want to share these documents with others on their team and build a library of documents to solve everyday problems.

Optionally, I am asked to provide parameters for my document. These parameters can be used in all of the scripts that you will create later. In my example, I have created three parameters: imageId, tagValue, and instanceType. When I come to execute this document, I will have the opportunity to provide values for these parameters that will override any defaults that I set.

When someone executes my document, the scripts that are executed will interact with AWS services. A document runs with the user permissions for most of its actions along with the option of providing an Assume Role. However, for documents with the Run a Script action, the role is required when the script is calling any AWS API.

You can set the Assume role globally in the builder tool; however, I like to add a parameter called assumeRole to my document, this gives anyone that is executing it the ability to provide a different one.

You then wire this parameter up to the global assumeRole by using the {{assumeRole}}syntax in the Assume role property textbox (I have called my parameter name assumeRole but you could call it what you like, just make sure that the name you give the parameter is what you put in the double parentheses syntax e.g.{{yourParamName}}).

Once my document is set up, I then need to create the first step of my document. Your document can contain 1 or more steps, and you can create sophisticated workflows with branching, for example based on a parameter or failure of a step. Still, in this example, I am going to create three steps that execute one after another. Again you need to give the step a name and a description. This description can also include markdown. You need to select an Action Type, for this example I will choose Run a script.

With the ‘Run a script’ action type, I get to run a script in Python or PowerShell without requiring any infrastructure to run the script. It’s important to realise that this script will not be running on one of your EC2 instances. The scripts run in a managed compute environment. You can configure a Amazon CloudWatch log group on the preferences page to send outputs to a CloudWatch log group of your choice.

In this demo, I write some Python that creates an EC2 instance. You will notice that this script is using the AWS SDK for Python. I create an instance based upon an image_id, tag_value, and instance_type that are passed in as parameters to the script.

To pass parameters into the script, in the Additional Inputs section, I select InputPayload as the input type. I then use a particular YAML format in the Input Value text box to wire up the global parameters to the parameters that I am going to use in the script. You will notice that again I have used the double parentheses syntax to reference the global parameters e.g. {{imageId}}

In the Outputs section, I also wire up an output parameter than can be used by subsequent steps.

Next, I will add a second step to my document . This time I will poll the instance to see if its status has switched to ok. The exciting thing about this code is the InstanceId, is passed into the script from a previous step. This is an example of how the execution steps can be chained together to use outputs of earlier steps.

def poll_instance(events, context):
    import boto3
    import time

    ec2 = boto3.client('ec2')

    instance_id = events['InstanceId']

    print('[INFO] Waiting for instance to enter Status: Ok', instance_id)

    instance_status = "null"

    while True:
    res = ec2.describe_instance_status(InstanceIds=[instance_id])

    if len(res['InstanceStatuses']) == 0:
        print("Instance Status Info is not available yet")
        time.sleep(5)
        continue

    instance_status = res['InstanceStatuses'][0]['InstanceStatus']['Status']

    print('[INFO] Polling get status of the instance', instance_status)

    if instance_status == 'ok':
        break

    time.sleep(10)

    return {'Status': instance_status, 'InstanceId': instance_id}

To pass the parameters into the second step, notice that I use the double parentheses syntax to reference the output of a previous step. The value in the Input value textbox {{launchEc2Instance.payload}} is the name of the step launchEc2Instance and then the name of the output parameter payload.

Lastly, I will add a final step. This step will run a PowerShell script and use the AWS Tools for PowerShell. I’ve added this step purely to show that you can use PowerShell as an alternative to Python.

You will note on the first line that I have to Install the AWSPowerShell.NetCore module and use the -Force switch before I can start interacting with AWS services.

All this step does is take the InstanceId output from the LaunchEc2Instance step and use it to return the InstanceType of the ECS instance.

It’s important to note that I have to pass the parameters from LaunchEc2Instance step to this step by configuring the Additional inputs in the same way I did earlier.

Now that our document is created we can execute it. I go to the Actions & Change section of the menu and select Automation, from this screen, I click on the Execute automation button. I then get to choose the document I want to execute. Since this is a document I created, I can find it on the Owned by me tab.

If I click the LaunchInstance document that I created earlier, I get a document details screen that shows me the description I added. This nicely formatted description allows me to generate documentation for my document and enable others to understand what it is trying to achieve.

When I click Next, I am asked to provide any Input parameters for my document. I add the imageId and ARN for the role that I want to use when executing this automation. It’s important to remember that this role will need to have permissions to call any of the services that are requested by the scripts. In my example, that means it needs to be able to create EC2 instances.

Once the document executes, I am taken to a screen that shows the steps of the document and gives me details about how long each step took and respective success or failure of each step. I can also drill down into each step and examine the logs. As you can see, all three steps of my document completed successfully, and if I go to the Amazon Elastic Compute Cloud (EC2) console, I will now have an EC2 instance that I created with tag LaunchedBySsmAutomation.

These new features can be found today in all regions inside the AWS Systems Manager console so you can start using them straight away.

Happy Automating!

— Martin;

Some packet-fu with Zeek (previously known as bro), (Mon, Nov 11th)

This post was originally published on this site

During an incident response process, one of the fundamental variables to consider is speed. If a net capture is being made where we can presumably find evidence that who and how is causing an incident, any second counts in order to anticipate the attacker in the cyber kill chain sequence.

We need to use a passive approach in the analysis of network traffic to be quick in obtaining results. Zeek is a powerful tool to use in these scenarios. It is a tool with network traffic processing capabilities for application level protocols (DCE-RPC, DHCP, DNP3, DNS, FTP, HTTP, IMAP, IRC, KRB, MODBUS, MQTT, MYSQL, NTLM, NTP, POP3, RADIUS, RDP, RFB, SIP, SMB, SMTP, SOCKS, SSH, SSL, SYSLOG, TUNNELS, XMPP), pattern search and a powerful scripting language to process what the incident responder might require.

Zeek scripts work through events. We can find a summary of all possible events that can be used at https://docs.zeek.org/en/stable/scripts/base/bif/event.bif.zeek.html. Next we will review those that will be covered by the examples of this diary:

  • new_connection: This event is raised everytime a new connection is detected.
  • zeek_done: This event is raised when the packet input is exhausted.
  • protocol_confirmation: This event is raised when zeek was able to confirm the protocol inside a specific connection.

We will cover three simple use cases in this diary:

  • Top talkers by source IP connection and new connections performed.
  • Top talkers by source IP and destination port, with new connections performed.
  • Number of connections confirmed by zeek for a specific IP address with a specific protocol.

Top talkers by source IP connection

The following script implements the use case:

global attempts: table[addr] of count &default=0; 
event new_connection (c: connection)
{
    local source = c$id$orig_h;
    local n = ++attempts[source];
}

event zeek_done ()
{
    local toplog=open(“toptalkers.log”);
    for (k in attempts)
        print toplog,fmt(“%s %s”,attempts[k],k);
    close(toplog);
}

 

Let’s go through the script in detail:

  • We will store the result in the attempts table. We will store there IP addresses type addr and count the occurrences with type count.
  • Using the new_connection event, we traverse the capture counting source IP addresses that generate new connections.
  • Once the packet input is exhausted, using the zeek_done event we create the toptalkers.log file and write the information in the attempts table separated by blank spaces.

Let’s see a snippet of the output:

We can get a sorted output:

Top talkers by source IP and destination port, with new connections performed

The following script implements the use case:

global attempts: table[addr,port] of count &default=0; 
event new_connection (c: connection)
{
    local source = c$id$orig_h;
    local the_port = c$id$resp_p;
    local n = ++attempts[source,the_port];
}

event zeek_done ()
{
    local toplog=open(“toptalkers.log”);
    for ([k,l] in attempts)
        print toplog,fmt(“%s %s %s”,attempts[k,l],k,l);
    close(toplog);
}

 

Let’s review the differences from the previous one:

  • Table now includes ports. Therefore, a new type is included in declaration: port.
  • The source IP, the destination port and the counter for each repetition of this pair of data in the network capture are stored in the code within the new_connection event.
  • Information is written once packet processing is finished to file toptalkers.log.

Let’s see a snippet of the script’s output:

We can get a sorted output:

Number of connections confirmed by zeek for a specific IP address with a specific protocol

The following script implements the use case:

global attempts: table[addr,Analyzer::Tag] of count &default=0; 
event protocol_confirmation (c: connection, the_type: Analyzer::Tag, aid:count)
{
    local source = c$id$orig_h;
    local n = ++attempts[source,the_type];
}

event zeek_done ()
{
    local toplog=open(“toptalkers.log”);
    for ([k,l] in attempts)
        print toplog,fmt(“%s,%s,%s”,k,l,attempts[k,l]);
    close(toplog);
}

 

We can see the some new aspects:

  • The Analyzer::Tag attribute: When the protocol_confirmation event is raised, this attribute saves the protocol that was confirmed by zeek to be in the connection.
  • Information is stored in the table and then saved to a file within the zeek_done event.

Let’s see a snippet of the script’s output:

In my next diaries I will cover other interesting use cases with zeek using the frameworks that it has.

Manuel Humberto Santander Peláez
SANS Internet Storm Center – Handler
Twitter: @manuelsantander
Web:http://manuel.santander.name
e-mail: msantand at isc dot sans dot org

(c) SANS Internet Storm Center. https://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.