<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>Bommarito Consulting</title> <atom:link href="http://michaelbommarito.com/feed/" rel="self" type="application/rss+xml" /><link>http://michaelbommarito.com</link> <description>Cloud infrastructure, software development, and big data solutions.</description> <lastBuildDate>Mon, 14 May 2012 17:56:24 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" /> <item><title>Down with the static</title><link>http://michaelbommarito.com/2012/05/06/down-with-the-static/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=down-with-the-static</link> <comments>http://michaelbommarito.com/2012/05/06/down-with-the-static/#comments</comments> <pubDate>Sun, 06 May 2012 11:58:06 +0000</pubDate> <dc:creator>Michael J Bommarito II</dc:creator> <category><![CDATA[Company News]]></category><guid
isPermaLink="false">http://ec2-23-22-85-135.compute-1.amazonaws.com/?p=21</guid> <description><![CDATA[After six years of static HTML, it finally became apparent that this site needed a real CMS.  Please excuse the construction while we settle in to our new home.]]></description> <content:encoded><![CDATA[<p>After six years of static HTML, it finally became apparent that this site needed a real CMS.  Please excuse the construction while we settle in to our new home.</p> ]]></content:encoded> <wfw:commentRss>http://michaelbommarito.com/2012/05/06/down-with-the-static/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Generating AWS CloudSearch SDF for Emails</title><link>http://michaelbommarito.com/2012/04/21/generating-aws-cloudsearch-sdf-for-emails/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=generating-aws-cloudsearch-sdf-for-emails</link> <comments>http://michaelbommarito.com/2012/04/21/generating-aws-cloudsearch-sdf-for-emails/#comments</comments> <pubDate>Sat, 21 Apr 2012 13:05:57 +0000</pubDate> <dc:creator>Michael J Bommarito II</dc:creator> <category><![CDATA[Programming]]></category> <category><![CDATA[Research]]></category> <category><![CDATA[aws]]></category> <category><![CDATA[cloud]]></category> <category><![CDATA[eDiscovery]]></category> <category><![CDATA[finance]]></category> <category><![CDATA[legal informatics]]></category> <category><![CDATA[programming]]></category> <category><![CDATA[python]]></category><guid
isPermaLink="false">http://www.michaelbommarito.com/blog/?p=661</guid> <description><![CDATA[  In my last post on CloudSearch and eDiscovery, I described something like &#8220;Google&#8221; for eDiscovery emails.  FedEx or DropBox your data to an eDiscovery service provider like myself, and rest assured that you&#8217;ll soon have a powerful, web-based user<span
class="ellipsis">&#8230;</span> <a
href="http://michaelbommarito.com/2012/04/21/generating-aws-cloudsearch-sdf-for-emails/"><div
class="read-more">Read more &#8250;</div></a>]]></description> <content:encoded><![CDATA[<p
style="text-align: justify;">  In <a
title="“Google” for subpoenaed emails: AWS CloudSearch for eDiscovery" href="http://www.michaelbommarito.com/blog/2012/04/21/google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery/" target="_blank">my last post on CloudSearch and eDiscovery</a>, I described something like &#8220;Google&#8221; for eDiscovery emails.  FedEx or DropBox your data to <a
title="eDiscovery Service Provider" href="http://www.michaelbommarito.com/" target="_blank">an eDiscovery service provider like myself</a>, and rest assured that you&#8217;ll soon have a powerful, web-based user interface for searching and visualizing your digital discovery materials.</p><p
style="text-align: justify;">  As a technical follow-up to this post, I thought I&#8217;d share a proof-of-concept email parser based on <a
title="Enron email dataset" href="http://www.cs.cmu.edu/~enron/" target="_blank">the Enron email dataset</a>.  The Python script below takes a directory of RFC822 email messages and returns an AWS CloudSearch JSON SDF with fields from the Date, From, To, Subject, and Body fields of the email.  There is no special handling for attachments or encoding in this example, but it can be used to populate a CloudSearch domain from the Enron emails. Sample usage below, as well as <a
href="https://s3.amazonaws.com/michaelbommarito.com/data/allen-p-inbox.json" target="_blank">the output sample here</a>.</p><pre>$ python src/generateSDF.py "data/maildir/allen-p/inbox/*" | curl -X POST -d "@-" --header "Content-Type: application/json" doc-domain_name-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch</pre><p>Source code below the break.</p><p><span
id="more-661"></span><br
/><script src="https://gist.github.com/2436913.js?file=generateSDF_RFC822.py"></script></p> ]]></content:encoded> <wfw:commentRss>http://michaelbommarito.com/2012/04/21/generating-aws-cloudsearch-sdf-for-emails/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>&#8220;Google&#8221; for subpoenaed emails: AWS CloudSearch for eDiscovery</title><link>http://michaelbommarito.com/2012/04/21/google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery</link> <comments>http://michaelbommarito.com/2012/04/21/google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery/#comments</comments> <pubDate>Sat, 21 Apr 2012 12:49:16 +0000</pubDate> <dc:creator>Michael J Bommarito II</dc:creator> <category><![CDATA[Law]]></category> <category><![CDATA[Programming]]></category> <category><![CDATA[Technology]]></category> <category><![CDATA[aws]]></category> <category><![CDATA[cloud]]></category> <category><![CDATA[computational legal studies]]></category> <category><![CDATA[data]]></category> <category><![CDATA[eDiscovery]]></category> <category><![CDATA[law]]></category> <category><![CDATA[legal informatics]]></category> <category><![CDATA[visualization]]></category><guid
isPermaLink="false">http://www.michaelbommarito.com/blog/?p=655</guid> <description><![CDATA[  In the last post on AWS CloudSearch, I provided a tutorial on the creation of a simple CloudSearch domain for Supreme Court decisions.  This walkthrough described the steps of creating a domain, configuring access policies and indexing, populating the index,<span
class="ellipsis">&#8230;</span> <a
href="http://michaelbommarito.com/2012/04/21/google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery/"><div
class="read-more">Read more &#8250;</div></a>]]></description> <content:encoded><![CDATA[<p
style="text-align: justify;">  In the last post on <a
title="AWS CloudSearch" href="http://aws.amazon.com/cloudsearch/" target="_blank">AWS CloudSearch</a>, I provided a tutorial on <a
title="Building an AWS CloudSearch domain for the Supreme Court" href="http://www.michaelbommarito.com/blog/2012/04/15/building-an-aws-cloudsearch-domain-for-the-supreme-court/" target="_blank">the creation of a simple CloudSearch domain for Supreme Court decisions</a>.  This walkthrough described the steps of creating a domain, configuring access policies and indexing, populating the index, and using the search API.  We were left with a functioning case search database.</p><p
style="text-align: justify;">  From a technical perspective, one key difference between this example and many real-world applications is that we let the CloudSearch tools automatically decide what fields and content were available to search.  While this worked well in the previous example, I want to provide a concrete example of a context in which custom services and development are required.</p><p
style="text-align: justify;">  Imagine you&#8217;re a smaller law firm that specializes in HR disputes.  As part of a time-sensitive non-solicitation claim filed by your client, you&#8217;ve subpoenaed email from fifteen employees at a client&#8217;s competitor.   It&#8217;s Friday afternoon at 5PM, and you finally receive a hard drive with the emails.  However, in an effort to overwhelm your small team, the other party has dumped 10GB of data on your plate.  There&#8217;s no way you can search through this by hand.  You have a hearing on Wednesday, but need to prepare a strategy memo for your client by Monday morning.  Do you disappoint your client and motion to reschedule?  How could you possibly make the deadline?  If only you could just press a button and get something like Google for your data&#8230;</p><p
style="text-align: justify;">  Combined with the right service provider (like <a
title="Bommarito Consulting" href="http://michaelbommarito.com/consulting.html" target="_blank">Bommarito Consulting</a>!), AWS CloudSearch is a perfect solution for this problem.  Before CloudSearch, existing available on-site infrastructure constrained the provision of eDiscovery services.  eDiscovery service providers had to make large capital expenditures on servers and storage to meet peak customer needs, inflating the price paid by other customers.  Even if eDiscovery service providers were leveraging Infrastructure-as-a-Service (IaaS) provisioners like AWS EC2, there was still a significant amount of operations overhead required to manage variable customer demand.</p><p
style="text-align: justify;">  CloudSearch makes these problems disappear.  In our example above, building a &#8220;Google&#8221; for your subpoenaed emails can be done in just hours.  The core components are an RFC822 parser to populate the search domain and a front-end user interface for searching and visualizing the results.  If this service sounds valuable to your business, today or just prospectively, please feel free to <a
title="Bommarito Consulting Contact" href="http://michaelbommarito.com/" target="_blank">call or email</a> regarding a demo or additional information.</p> ]]></content:encoded> <wfw:commentRss>http://michaelbommarito.com/2012/04/21/google-for-subpoenaed-emails-aws-cloudsearch-for-ediscovery/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Building an AWS CloudSearch domain for the Supreme Court</title><link>http://michaelbommarito.com/2012/04/15/building-an-aws-cloudsearch-domain-for-the-supreme-court/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=building-an-aws-cloudsearch-domain-for-the-supreme-court</link> <comments>http://michaelbommarito.com/2012/04/15/building-an-aws-cloudsearch-domain-for-the-supreme-court/#comments</comments> <pubDate>Sun, 15 Apr 2012 14:29:33 +0000</pubDate> <dc:creator>Michael J Bommarito II</dc:creator> <category><![CDATA[Law]]></category> <category><![CDATA[Programming]]></category> <category><![CDATA[Research]]></category> <category><![CDATA[aws]]></category> <category><![CDATA[cloud]]></category> <category><![CDATA[cloudsearch]]></category> <category><![CDATA[language]]></category> <category><![CDATA[law]]></category> <category><![CDATA[legal informatics]]></category><guid
isPermaLink="false">http://www.michaelbommarito.com/blog/?p=641</guid> <description><![CDATA[  It should be pretty clear by now that two things I&#8217;m very interested in are cloud computing and legal informatics.  What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions<span
class="ellipsis">&#8230;</span> <a
href="http://michaelbommarito.com/2012/04/15/building-an-aws-cloudsearch-domain-for-the-supreme-court/"><div
class="read-more">Read more &#8250;</div></a>]]></description> <content:encoded><![CDATA[<p
style="text-align: justify;">  It should be pretty clear by now that two things I&#8217;m very interested in are <a
href="http://www.michaelbommarito.com/blog/tag/cloud/" target="_blank">cloud computing</a> and <a
href="http://www.michaelbommarito.com/blog/tag/legal-informatics/" target="_blank">legal informatics</a>.  What better way to show it than to put together a simple <a
title="Cloud Search" href="http://aws.amazon.com/cloudsearch/" target="_blank">AWS CloudSearch</a> tutorial using Supreme Court decisions as the context?  The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.</p><h2 style="text-align: justify;">Acquiring Supreme Court decision data</h2><p
style="text-align: justify;">  Our first step is to acquire a public domain copy of Supreme Court decisions from <a
href="https://twitter.com/carlmalamud">Carl Malamud</a>&#8216;s <a
href="https://public.resource.org/">resource.org</a>.  You can navigate to <a
href="http://bulk.resource.org/courts.gov/c/" target="_blank">this directory</a> and download US.tar.bz2, or just run something like:</p><pre>$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2</pre><p>Once the download is done, extract the archive:</p><pre>$ tar xjf US.tar.bz2</pre><p
style="text-align: justify;">  We should now have a directory called US with 1.1GB and 62,839 files.  Let&#8217;s assume that you put this directory under something like <strong>/data/courts/US</strong>.</p><h2 style="text-align: justify;">Setting up Cloud Search command line tools</h2><p
style="text-align: justify;">  The next step is easy &#8211; go follow <a
title="Installing AWS Cloud Search Command Line Tools" href="http://www.michaelbommarito.com/blog/2012/04/14/installing-aws-cloud-search-command-line-tools/" target="_blank">my guide on setting up Cloud Search command line tools</a>!  I&#8217;ll assume that you placed everything under <strong>/opt/aws/cloud-search-tools</strong>, just like in that post.</p><h2 style="text-align: justify;">Creating a Cloud Search Domain</h2><p
style="text-align: justify;">  OK, we should now have a dataset and the Cloud Search API at our fingertips.  It&#8217;s time to create a Cloud Search &#8220;domain&#8221; that we can populate with records.  To do so, you can either follow the instructions on your AWS Management Console or run the following:</p><pre>$ /opt/aws/cloud-search-tools/bin/cs-create-domain -d scotus</pre><p
style="text-align: justify;">  This may take awhile to create; sometimes up to 15 minutes. Go grab a coffee or a beer and read your feed while you wait.  You can check the status either through the Management Console in browser or with the following line:</p><pre>$ /opt/aws/cloud-search-tools/bin/cs-describe-domain -d scotus</pre><p
style="text-align: justify;">  Once this step is complete, you should see an ACTIVE domain with 0 documents. We now need to reconfigure the access policies so that the domain allows us to submit search material and anyone to search:</p><pre>$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow IP_ADDRESS --service doc
$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow all --service search</pre><p>This policy change may take a few minutes to go into effect.</p><p>Lastly, we need to tell the domain what we are indexing per document.</p><pre>$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name title --type text --option result
$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name content --type text --option result</pre><h2 style="text-align: justify;">Populating the Cloud Search Domain</h2><p
style="text-align: justify;">  OK, we&#8217;re ready to go!  At this point, we need to generate Search Data Format (SDF) files to populate the domain.  There are two approaches we can take:</p><ol><li
style="text-align: justify;">Write a parser to extract exactly the text content and metadata we want.</li><li>Throw the pre-packaged <strong>cs-generate-sdf</strong> utility at our data and hope for the best.</li></ol><p
style="text-align: justify;">  For brevity&#8217;s sake, we&#8217;ll pursue option 2.  After some poking around, I&#8217;ve found that <strong>cs-generate-sdf</strong> is based on a common open-source content extraction library &#8211; <a
title="Tika" href="http://tika.apache.org/" target="_blank">Apache Tika</a>.  You might be familiar with Tika, as it&#8217;s the guts behind <a
title="Solr" href="http://lucene.apache.org/solr/" target="_blank">Solr&#8217;s</a> ability to ingest unstructured data.  So if you&#8217;d be happy naively ingesting the content in Solr, you&#8217;ll probably be happy with the results that cs-generate-sdf produces.</p><p>  While we could build something more complex, let&#8217;s stick to bash here:</p><pre>$ for d in `find /data/courts/US/ -type d`;
do
  /opt/aws/cloud-search-tools/bin/cs-generate-sdf --source "$d/*.html" -d scotus;
done</pre><p
style="text-align: justify;">  A few things to note:</p><ul><li
style="text-align: justify;">If you see error messages like &#8220;Request forbidden by administrative rules&#8221; or &#8220;403 Forbidden&#8221;, your access policies have not taken effect or you provided the wrong IP for the document service.</li><li>You should see lots of lines go by; two for every file that is being parsed.</li><li>This step can be parallelized, but will almost certainly be disk-bound unless you are running on some kind of RAID or NAS setup that allows for concurrent reads.</li></ul><p
style="text-align: justify;">  This could take awhile; about 45 minutes to generate and transmit on my i7 2600k/32GB RAM/SATA III SSD workstation.  You should grab another coffee or beer and watch a show.</p><p
style="text-align: justify;"><strong>  Another caveat</strong>: even after you&#8217;ve transmitted all data up to the cloud, it will still take some time for the Cloud Search instance to churn through the data and complete indexing.</p><h2>Searching the Cloud Search Domain</h2><p
style="text-align: justify;">  Once the Cloud Search instance is fully built, it&#8217;s time to figure out how to search.  The best way to do this is, sadly, to read <a
title="AWS Searching" href="http://docs.amazonwebservices.com/cloudsearch/latest/developerguide/searching.html" target="_blank">the developer documentation</a>.  However, if you want to skip all the boring part, just try running something like this:</p><pre>$ curl 'http://search-scotus-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q="clear%20and%20present%20danger"&amp;return-fields=title'</pre><p
style="text-align: justify;">  This search looks for an exact phrase match on &#8220;clear and present danger&#8221; and returns not only the document ID, but also the title property of the document.  You should get back something like this:</p><pre>{"rank":"-text_relevance","match-expr":"(label '"clear and present danger"')","hits":{"found":100,"start":0,"hit":[{"id":"d__data_courts_us_395_395_us_444_492_html","data":{"title":["395 U.S. 444"]}},{"id":"d__data_courts_us_343_343_us_946_326_html","data":{"title":["343 U.S. 946"]}},{"id":"d__data_courts_us_341_341_us_494_336_html","data":{"title":["341 U.S. 494"]}},{"id":"d__data_courts_us_370_370_us_375_369_html","data":{"title":["370 U.S. 375"]}},{"id":"d__data_courts_us_435_435_us_829_76_1450_html","data":{"title":["435 U.S. 829"]}},{"id":"d__data_courts_us_328_328_us_331_473_html","data":{"title":["328 U.S. 331"]}},{"id":"d__data_courts_us_360_360_us_924_488_html","data":{"title":["360 U.S. 924"]}},{"id":"d__data_courts_us_414_414_us_890_72_6629_html","data":{"title":["414 U.S. 890"]}},{"id":"d__data_courts_us_295_295_us_441_665_html","data":{"title":["295 U.S. 441"]}},{"id":"d__data_courts_us_331_331_us_367_241_html","data":{"title":["331 U.S. 367"]}}]},"info":{"rid":"90c9b0fdba3e834bd8a0834c12371bbbcbe700391fa33547ff19c86ee8af36004f16216852072604","time-ms":5,"cpu-time-ms":0}}</pre><p
style="text-align: justify;">  So, there it is! Your own fully searchable AWS Cloud Search domain for the Supreme Court. Not so bad after all, was it?</p> ]]></content:encoded> <wfw:commentRss>http://michaelbommarito.com/2012/04/15/building-an-aws-cloudsearch-domain-for-the-supreme-court/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Visualization of Reading Level Frequency by Congressional Bill Stage</title><link>http://michaelbommarito.com/2012/04/15/visualization-of-reading-level-frequency-by-congressional-bill-stage/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=visualization-of-reading-level-frequency-by-congressional-bill-stage</link> <comments>http://michaelbommarito.com/2012/04/15/visualization-of-reading-level-frequency-by-congressional-bill-stage/#comments</comments> <pubDate>Sun, 15 Apr 2012 12:52:24 +0000</pubDate> <dc:creator>Michael J Bommarito II</dc:creator> <category><![CDATA[Law]]></category> <category><![CDATA[Programming]]></category> <category><![CDATA[Research]]></category> <category><![CDATA[computing]]></category> <category><![CDATA[data]]></category> <category><![CDATA[politics]]></category> <category><![CDATA[r]]></category> <category><![CDATA[visualization]]></category><guid
isPermaLink="false">http://www.michaelbommarito.com/blog/?p=635</guid> <description><![CDATA[  Here&#8217;s a fun example of how you might use my data on Congressional bill length and complexity.  Imagine you want to understand the empirical distribution of Flesch-Kincaid reading level for Congressional bills and how this distribution is related to<span
class="ellipsis">&#8230;</span> <a
href="http://michaelbommarito.com/2012/04/15/visualization-of-reading-level-frequency-by-congressional-bill-stage/"><div
class="read-more">Read more &#8250;</div></a>]]></description> <content:encoded><![CDATA[<p
style="text-align: justify;">  Here&#8217;s a fun example of how you might use <a
title="Updates to data and statistics on Congressional bill complexity" href="http://www.michaelbommarito.com/blog/2012/04/14/updates-to-data-and-statistics-on-congressional-bill-complexity/">my data on Congressional bill length and complexity</a>.  Imagine you want to understand the empirical distribution of Flesch-Kincaid reading level for Congressional bills and how this distribution is related to bill stage.  A first step might be to visualize this relationship.</p><p
style="text-align: justify;"><a
href="http://www.michaelbommarito.com/blog/wp-content/uploads/2012/04/reading_level_bill_stage_20120415.jpg"><img
class="aligncenter  wp-image-636" title="Reading Level, Bill Stage" src="http://www.michaelbommarito.com/blog/wp-content/uploads/2012/04/reading_level_bill_stage_20120415.jpg" alt="" width="700" /></a></p><p
style="text-align: justify;">  Based on this visualization, you might infer that engrossed bills tend to have less right-skew and have a lower mean reading level.  The story behind this might be that Senators and Representatives are less likely to accept legislation they do not understand.  To test this, you might run a simple KS test to see if the introduced bill reading levels are greater than engrossed bill reading levels.</p><pre>&gt; ks.test(introduced, engrossed, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  introduced and engrossed
D^- = 0.094, p-value = 0.006299
alternative hypothesis: the CDF of x lies below that of y</pre><p>Sample source below.<br
/><script src="https://gist.github.com/2392628.js?file=bill_complexity_example1.R"></script></p> ]]></content:encoded> <wfw:commentRss>http://michaelbommarito.com/2012/04/15/visualization-of-reading-level-frequency-by-congressional-bill-stage/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using apc
Page Caching using apc
Database Caching using apc
Object Caching 683/768 objects using apc

Served from: michaelbommarito.com @ 2012-05-19 07:35:36 -->
